lxml and Requests¶
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands: $ apt-get install python-lxml $ easyinstall lxml $ pip install lxml. Building without Cython. ERROR: /bin/sh: xslt-config: not found. make sure the development packages of libxml2 and libxslt are installed. Using build configuration of libxslt In file included from src/lxml/lxml.etree.c:227:0: src/lxml/etreedefs.h:9:31: fatal error: libxml/xmlversion.h: No such file or directory compilation terminated. Since lxml 2.0, the parsers have a feed parser interface that is compatible to the ElementTree parsers. You can use it to feed data into the parser in a controlled step-by-step way. In lxml.etree, you can use both interfaces to a parser at the same time: the parse or XML functions, and the feed parser interface. Both are independent.
lxml is a pretty extensive library written for parsingXML and HTML documents very quickly, even handling messed up tags in theprocess. We will also be using theRequests module instead of thealready built-in urllib2 module due to improvements in speed and readability.You can easily install both using
Let’s start with the imports:
Next we will use
requests.get to retrieve the web page with our data,parse it using the
html module, and save the results in
(We need to use
page.content rather than
html.fromstring implicitly expects
bytes as input.)
tree now contains the whole HTML file in a nice tree structure whichwe can go over two different ways: XPath and CSSSelect. In this example, wewill focus on the former.
XPath is a way of locating information in structured documents such asHTML or XML documents. A good introduction to XPath is onW3Schools .
There are also various tools for obtaining the XPath of elements such asFireBug for Firefox or the Chrome Inspector. If you’re using Chrome, youcan right click an element, choose ‘Inspect element’, highlight the code,right click again, and choose ‘Copy XPath’.
After a quick analysis, we see that in our page the data is contained intwo elements – one is a div with title ‘buyer-name’ and the other is aspan with class ‘item-price’:
Knowing this we can create the correct XPath query and use the lxml
xpath function like this:
Let’s see what we got exactly:
Congratulations! We have successfully scraped all the data we wanted froma web page using lxml and Requests. We have it stored in memory as twolists. Now we can do all sorts of cool stuff with it: we can analyze itusing Python or we can save it to a file and share it with the world.
Some more cool ideas to think about are modifying this script to iteratethrough the rest of the pages of this example dataset, or rewriting thisapplication to use threads for improved speed.