By Gabor Laszlo Hajba | 11/21/2016 | General |Beginners

Website scraping with BeautifulSoup4

Website scraping with BeautifulSoup4

In this article I will give you an introduction to how you can scrape websites with BeautifulSoup4.

BeautifulSoup is a library for pulling out data from HTML or XML documents. As you can see, it is ideal to do some website scraping. This means that I will split the article into two parts: a quick introduction to BeautifulSoup and then I will show you how you can get the data extracted with the parser.

BeautifulSoup

As I mentioned in the introduction: BeautifulSoup is a content extractor. The nice feature about it is that you do not need to have a well-formed HTML document - and let's be honest many websites are not well-formed, at least the dynamic ones can have flaws.

To get the examples working, you will need Python 3 and BeautifulSoup 4 installed. You can gather the library from pip with the following command:

pip install bs4

Let's work with the following website snippet:

 

>>> html = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<meta name="Description" content="Beautiful Soup: a library designed for screen-scraping HTML and XML.">
<meta name="generator" content="Markov Approximation 1.4 (module: leonardr)">
<meta name="author" content="Leonard Richardson">
</head>
<body bgcolor="white" text="black" link="blue" vlink="660066" alink="red">
<img align="right" src="10.1.jpg" width="250"><br />

<p>You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround
screen scraping projects.</p>
</body>
</html>
"""
>>>
>>> from bs4 import BeautifulSoup
>>> site = BeautifulSoup(html)
>>> site.title
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
>>> site.title.text
'Beautiful Soup: We called him Tortoise because he taught us.'
>>>


As you can see, it is easy to access the tags of the site. Let's extract meta information from the snippet:

>>> site.meta
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>

Well, this is not what we imagined. We get back only the first meta tag with this kind of access. The dot notation executes the find method in the background which finds the first tag matching the given name.

But there is a way to get all the meta information:

>>> site.find_all('meta')
[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, <meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>, <meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>, <meta content="Leonard Richardson" name="author"/>]

The find_all method, as its name already mentions, finds all the tags with the given name and returns the tags in a list. To extract some content from these tags we can access the attributes like keywords in a dictionary:

>>> for meta in site.find_all('meta'):
...     meta['content']
...
'text/html; charset=utf-8'
'Beautiful Soup: a library designed for screen-scraping HTML and XML.'
'Markov Approximation 1.4 (module: leonardr)'
'Leonard Richardson'

What about the name attribute of these meta tags?

>>> for meta in site.find_all('meta'):
...     meta['name']
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/element.py", line 954, in __getitem__
    return self.attrs[key]
KeyError: 'name'

Wow, this is something unexpected! That's because the first meta tag has no name attribute. Fortunately we can fix this simply:

>>> for meta in site.find_all('meta'):
...     if meta.has_attr('name'):
...         meta['name']
...
'Description'
'generator'
'author'

With the has_attr method on Tag objects you can verify if the given tag has an attribute or not. This comes in handy while scraping websites because it can happen that some results differ from the average and do not provide information you want to extract. And as you can see it can lead to errors which could terminate your application unexpectedly.

Gathering the whole contents

As I mentioned: BeautifulSoup is a content extractor and we need some tool to gather the content from the web. The easiest way is urrlib.request.urlopen from the standard library:

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(urlopen("https://www.crummy.com/software/BeautifulSoup/"), 'html.parser')
>>> soup.title.text


'Beautiful Soup: We called him Tortoise because he taught us.'

As you can see, it is fairly simple. If your website gets more complex like requiring cookies or dynamic POST calls it can be troublesome with simple standard libraries.

For this I suggest you take a look at the requests library and use it. It leverages the cumbersome method calls -- and in the back it uses the built-in standard libraries so there is no big magic just easier code. You can find a brief introduction to requests in this article.

An example

In this example we will navigate through the blog and print each article's title and URL.

from bs4 import BeautifulSoup
from urllib.request import urlopen

def scrape(url):
    soup = BeautifulSoup(urlopen(url), 'html.parser')
    for article in soup.find_all('div', class_='ckt-article'):
        h2 = article.find('h2')
        a = h2.find('a', href=True)
        print('{}: {}'.format(a.text, 'http://discoversdk.com' + a['href']))

    next_page = soup.find('a', class_='ckt-next-btn')
    if next_page and 'disabled' not in next_page['class']:
        scrape('http://discoversdk.com' + next_page['href'])

if __name__ == '__main__':
    scrape('http://www.discoversdk.com/blog')

This example is really simple, the main functionality is in the scrape function which gets a URL as its parameter and this URL will be scraped.

First we gather the contents of the URL and create a "soup" out of it. The 'html.parser' parameter to the constructor of BeautifulSoup denotes that we want to use the HTML parser to parse the contents.

After the contents are loaded we select all the div tags which have the ckt-article class and iterate over the results. In the for loop we extract the title and the URL of the articles. This information is stored in an anchor (a tag) which resides inside an h2 tag. After we have found the required information we print it to the console. An alternative would be to gather the results and write them to a CSV file for future analysis.

The last part of the function does the continuous parsing of the blog: it looks for the anchor with the ckt-next-btn class and if it is not disabled (this information is stored in the tag's class attribute too) we call the scrape function recursively with the new URL -- which points us to the next site of the articles.

Handling limitations

Some websites have limitations to have a proper response times for users. This means some providers utilize counters and filters to see if you have too many requests in a given interval and if yes, your IP address is blocked for an amount of time (which can mean forever). So take care of this. I usually utilize a practice to let my Python code sleeping for a short amount of time between requests (1-2 seconds). With this approach I could avoid many bans while I was scraping information.

Speed limits

Because we download content from the web it can be slow depending on your network and the server you get the data from. This means that you can have a very slow scraper because of the target -- and you do not need any sleeping logic mentioned in the previous section.

Naturally this is not always optimal. If you want to scrape multiple websites or multiple pages of the same website it is a good practice to create some parallel code which splits the tasks. For downloading threads are fine because while one waits for network I/O another can start its downloading. For parsing multiple processes are better because parsing is CPU intensive and because of this threads would run after each other and not parallel.

To read more about parallelism on Python take a look at this article.

Conclusion

As you can see, BeautifulSoup itself is not capable of website scraping: you need to download the contents of the website and feed the HTML into the parser to extract content.

And always remember to honor the Terms of Service and Privacy Policy of websites you want to scrape: sometimes they prohibit content scraping (mostly for spiders and robots) but having a simple script is almost a bot.

By Gabor Laszlo Hajba | 11/21/2016 | General

{{CommentsModel.TotalCount}} Comments

Your Comment

{{CommentsModel.Message}}

Recent Stories

Top DiscoverSDK Experts

User photo
3355
Ashton Torrence
Web and Windows developer
GUI | Web and 11 more
View Profile
User photo
2340
Meir Rabinovich
Real time and embedded developer
Hardware and RT | General Libraries and 5 more
View Profile
User photo
1540
Nathan Gordon
Full Stack developer
Web | JavaScript and 1 more
View Profile
Show All
X

Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.

{{compareToolModel.Error}}

Now comparing:

{{product.ProductName | createSubstring:25}} X
Compare Now