xpath
Leggi la versione italiana
This brief tutorial will explain the basics of XPath and how it can serve as processing tool for HTML (or XML) pages.
Often we have to gather useful information from web sites, in HTML or XML formats, and we'd like to extract them. For instance, extracted information can be indexed in a database or whatelse. Imagine an online commerce website with a list of categories and brands you want to extract (this becomes boring when the list has more than 10 items), or a list of IP/port pairs, or yet bugs descriptions, etc. This article has the purpose to organize a very simple but generic script to deal with these situations. Of course you've already realized that such a script could be used together with a web crawler, both for parsing links and for gathering sensible data (at this point XSLT becomes a good and portable choice).
what's xpath
XPath is yet another devilry from w3c. It allows you to specify paths (similar to the filesystem ones) in order to search one or more particular tags in an XML document. What makes XPath powerful is that paths are not simple but you can specify several conditions for tags themselves. It's easier to write than to explain!
A couple of examples for understanding basic xpaths:
<catalog>
<product tested="true">Prodotto 1</product>
<product tested="false">Prodotto 2</product>
<product>
Product 3
<desc>Description 3</desc>
</product>
...
</catalog>
The following paths on the left are valid xpaths:
/catalog | → | returns a list of all the <catalog> contained in the document root |
/catalog/product | → | returns a list of all the <product> tags that... guess by yourself |
/catalog/product[1] | → | returns the first <product> tag |
/catalog/product[@tested="true"] | → | all the <product tested="true"> |
/catalog/product/@tested | → | all the tested attribute values of all product elements |
//product | → | all the <product> of the document, also the nested ones |
//product/desc | → | all the desc elements of all the product elements |
//desc | → | all the desc elements, also the ones outside product tags |
//desc/text() | → | all the text contained in each desc element of the document |
//product[desc] | → | all the product elements that have at least one desc child element |
With @ you specify attributes. Notice that with / (slash) you define the selection of what you want to select, while with [...] (squares) you specify the conditions of the contextual node. This will be even more clear if you use xpath for real.
Also notice that product[1] is the first element, not product[0] (it's not like C arrays), and in this case it doesn't returns a list but a single element because the path is not ambiguous.
You understand that for obtaining a list of all the links contained in an XML page (or XHTML), the relative xpath is immediately written:
//a/@hrefOk but... where's the output? Well this is a language, and it needs a sort of interpreter that in most cases it's a library of a programming language.
from html to xml
Before continuing, we must understand that XPath works with XML documents, which documents are well formatted by opened tags and the relative closed tags, attribute values with quotes, etc.. Most of the time you won't find XHTML pages but HTML, so XPath won't work.
As we're interested in HTML and broken HTML documents, we need a tool to convert them to XML. High level programming languages definitely help us in this task. In this tutorial we'll use Python and the LXML library (which has some optimized parts written in C).
A simple example:
import lxml.html
import sys
tree = lxml.html.parse (sys.argv[1])
Name the file parser.py then download any HTML page as page.html. Now it's possible to invoke our miniscript as follows:
$ python parser.py page.html
In the program the `tree' variable will contain the XML document, ready for elaborating xpaths. A variant of the above function exists for parsing directly strings: lxml.html.fromstring. That's useful with sockets, without saving data to the disk.
Let's finalize the script by adding the following line at the end:
print tree.xpath (sys.argv[2])
Perfect! Now we can pass an xpath as the second argument of the script. We'll get a pythonic output because the result of the function is, of course, pythonic data. It's your job to transform that python data into something else, or instead use XSLT.
real usage examples
Here we got a list of mailing lists at http://lists.debian.org/completeindex.html... let's save it. Now we want to obtain the complete list of all the mailing list names in that page.
Let's analyze the HTML code. You'll notice that you can find all of them inside an <a>, which is inside a <li>, which is inside an <ul>.
A possible xpath can be the following:
But this could be ambiguous because there might be other <ul><li><a> in other positions of the page. The current xpath is really simple, we can make it a bit more complicated and be sure to gather only the mailing lists. You've surely noticed that the page contains at least 10 <li> elements; that's a good filter for us:
//ul[li[10]]/li/a/text()This way we guarantee <ul> to have at least 10 <li> children.
Now let's launch our script:
$ python parser.py completeindex.html "//ul[li[10]]/li/a//text()"
['cdwrite', 'debian-68k', 'debian-accessibility', 'debian-admin', 'debian-admintool', 'debian-alpha', 'debian-amd64', 'debian-announce', 'debian-apache', 'debian-arm', 'debian-autobuild', 'debian-beowulf', 'debian-blends', 'debian-books', 'debian-boot', 'debian-bsd', 'debian-bugs-closed', ...
Now let's do a Yahoo! Search, and if the website hasn't changed since I've written the code, the following script should give you titles and urls of the main results:
import sys
import lxml.html
import urllib
baseurl = "http://search.yahoo.com/search?"
xpath = "//li/div/div/h3/a"
query = urllib.urlencode ({'p': sys.argv[1]})
page = urllib.urlopen(baseurl+query).read()
tree = lxml.html.fromstring (page)
elements = tree.xpath(xpath)
for element in elements:
print element.text_content(), '-', element.get('href')
I leave you with some interesting links on xpath, xslt and lxml :)
$ python search.py "xpath lxml xslt"
XML Path Language (XPath) - http://www.w3.org/TR/xpath
XSL Transformations - Wikipedia, the free encyclopedia - http://en.wikipedia.org/wiki/XSLT
XPath and XSLT with lxml - http://codespeak.net/lxml/xpathxslt.html
Why XSLT 2.0 and XPath 2.0? - http://www.altova.com/XSLT_XPath_2.html
...
conclusion
Even I personally dislike many things built around XML, I think it's necessary to admit that something useful has finally came out from that contorted world.
In particular, XSLT is sort of programming language with XML form (argh!) which makes use of XPath.
It can transform any XML document into any other document (not only XML). In the above code we used the `for' Python statement, which could be avoid if we used XSLT. Therefore, stylesheets written in XSLT are portable among different programming languages and more optimized. Use your fantasy... :)