-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Web Scraper Scenario #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Web Scraper Scenario #235
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
faae04c
Added scenario about web scraping using lxml
sirMackk 3aef3bd
2nd draft of web scraping scenario
sirMackk c3d7bdd
Third, final markup fixes.
sirMackk 83c9cba
Added a bit more code to improve understanding.
sirMackk a22a6e9
Fixing html code-block
sirMackk 32dea94
Using requests instead of urllib2, final draft.
sirMackk aa7f9aa
Final version
sirMackk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| HTML Scraping | ||
| ============= | ||
|
|
||
| Web Scraping | ||
| ------------ | ||
|
|
||
| Web sites are written using HTML, which means that each web page is a | ||
| structured document. Sometimes it would be great to obtain some data from | ||
| them and preserve the structure while we're at it. Web sites provide | ||
| don't always provide their data in comfortable formats such as ``.csv``. | ||
|
|
||
| This is where web scraping comes in. Web scraping is the practice of using a | ||
| computer program to sift through a web page and gather the data that you need | ||
| in a format most useful to you while at the same time preserving the structure | ||
| of the data. | ||
|
|
||
| lxml and Requests | ||
| ----------------- | ||
|
|
||
| `lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing | ||
| XML and HTML documents really fast. It even handles messed up tags. We will | ||
| also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2 | ||
| due to improvements in speed and readability. You can easily install both | ||
| using ``pip install lxml`` and ``pip install requests``. | ||
|
|
||
| Lets start with the imports: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from lxml import html | ||
| import requests | ||
|
|
||
| Next we will use ``requests.get`` to retrieve the web page with our data | ||
| and parse it using the ``html`` module and save the results in ``tree``: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') | ||
| tree = html.fromstring(page.text) | ||
|
|
||
| ``tree`` now contains the whole HTML file in a nice tree structure which | ||
| we can go over two different ways: XPath and CSSSelect. In this example, I | ||
| will focus on the former. | ||
|
|
||
| XPath is a way of locating information in structured documents such as | ||
| HTML or XML documents. A good introduction to XPath is on `W3Schools <http://www.w3schools.com/xpath/default.asp>`_ . | ||
|
|
||
| There are also various tools for obtaining the XPath of elements such as | ||
| FireBug for Firefox or if you're using Chrome you can right click an | ||
| element, choose 'Inspect element', highlight the code and then right | ||
| click again and choose 'Copy XPath'. | ||
|
|
||
| After a quick analysis, we see that in our page the data is contained in | ||
| two elements - one is a div with title 'buyer-name' and the other is a | ||
| span with class 'item-price': | ||
|
|
||
| :: | ||
|
|
||
| <div title="buyer-name">Carson Busses</div> | ||
| <span class="item-price">$29.95</span> | ||
|
|
||
| Knowing this we can create the correct XPath query and use the lxml | ||
| ``xpath`` function like this: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| #This will create a list of buyers: | ||
| buyers = tree.xpath('//div[@title="buyer-name"]/text()') | ||
| #This will create a list of prices | ||
| prices = tree.xpath('//span[@class="item-price"]/text()') | ||
|
|
||
| Lets see what we got exactly: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| print 'Buyers: ', buyers | ||
| print 'Prices: ', prices | ||
|
|
||
| :: | ||
|
|
||
| Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', | ||
| 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', | ||
| 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', | ||
| 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', | ||
| 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell'] | ||
|
|
||
| Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', | ||
| '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', | ||
| '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68', | ||
| '$15.00', '$114.07', '$10.09'] | ||
|
|
||
| Congratulations! We have successfully scraped all the data we wanted from | ||
| a web page using lxml and Requests. We have it stored in memory as two | ||
| lists. Now we can do all sorts of cool stuff with it: we can analyze it | ||
| using Python or we can save it a file and share it with the world. | ||
|
|
||
| A cool idea to think about is modifying this script to iterate through | ||
| the rest of the pages of this example dataset or rewriting this | ||
| application to use threads for improved speed. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo?