Three features make it powerful: 1. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application 2. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding. 3. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.
* Fast and powerful - write the rules to extract the data and let Scrapy do the rest. * Easily extensible - extensible by design, plug new functionality easily without having to touch the core. * Portable, Python - written in Python and runs on Linux, Windows, Mac and BSD. * Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions. * An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders. * Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem) * Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations. * Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines). * Wide range of built-in extensions and middlewares for handling: cookies and session handling HTTP features like compression, authentication, caching, user-agent spoofing, robots.txt, crawl depth restriction * A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching * DNS resolver, and much more!
BeautifulSoup4 vs Scrapy
In this article I will compare two solutions for website scraping with Python.
As mentioned previously: BeautifulSoup is a content extractor which means it needs to get the source of a website to be able to do parsing; in contrast Scrapy is a website scraping tool that uses Python, because Scrapy can crawl the contents of your webpage prior to extracting – BTW, don’t have to write much code to achieve this.
You could say: "Then Scrapy is the tool I always want to use". Well, this works most of the time but sometimes I prefer BeautifulSoup over Scrapy, and I’ll explain why.
A year ago...
One year ago, there was a point when I used BeautifulSoup without hesitating: Python 3. Scrapy was not released to this interpreter at that time (prior version 1.1) so if I wanted to do scraping based on Python 3, there was no other option but to use BeautifulSoup and a content downloader (requests) to achieve my goal!
It is different because Scrapy is available on Python 3 as well. Hands down it would be easier to use the tool over the content extractor but sometimes the simpler solution is better because you must write the code yourself and you can clearly steer the workflow -- with Scrapy a lot of things are hidden from your eyes and you need to read through the documentation and follow different forums to to find the right solution for your problem.
One example would be LinkedIn. You may want to scrape some information from this website (note Linkedin policy prohibits scraping but I mention it solely for the sake of my example) with Scrapy. Scrapy handles the cookies for you out of the box but LinkedIn is a tricky site: it has cookies which should not be added to your request even if they are included in the response: they contain the text "delete me" (or something similar to that) which will tell the server validating the request's cookies that something is not OK with the caller, it is not a regular website.
With BeautifulSoup and requests you can customize this behavior. Naturally this involves more coding but you have everything in the tip of your fingers to take care about all the cookies you want to send. requests, of course, can give you the solution to send cookies automatically but you can customize them too.
But if you want to get data from LinkedIn: use their API because this is the way they suggest and you do not get banned.
Most websites do not use "tricky" cookies like we have seen previously but count incoming requests from IP addresses in a given time-frame. If the amount is too high, the IP address is blocked for some period. And because Scrapy does multiple requests at a time you can easily encounter this problem.
If this happens try to set the CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP in the settings.py file to a smaller number. I suggest 1 or 2 instead of the default of 16. Another option is to increase the time between requests sent: change the DOWNLOAD_DELAY parameter again in the settings.py to a bigger value like 3-5.
Naturally this makes your scraping slower but you don't let the target website crash, it can handle requests of other users and you do not get blocked which in the end makes your scraping faster.
With BeautifulSoup there is no such problem by default because such applications run in one thread and block until the response is there from the server and the parsing is done. Naturally sometimes this can cause troubles if you have a fast internet connection and computer and you can gather and process information really fast and you get blocked. To avoid this you have to take care manually: you can add some sleep into your code for example between getting the website contents and extracting the information:
soup = BeautifulSoup(urlopen(url), 'html.parser')
for article in soup.find_all('div', class_='ckt-article'):
In the example above we wait 3 seconds before we extract the contents. I know this can be slow but sometimes it is faster than a "working for 5 minutes -- getting banned for 24 hours -- working for 5 minutes -- getting banned ..." almost endless loop.
Just the one or the other?
If you ask yourself: "Which one shall I use? I like how BeautifulSoup treats parsing but I love the ways Scrapy leverages my work with less code." In this case my answer is: use both. Because Scrapy is a website scraper it uses content extractors. And because BeautifulSoup is a content extractor you can include it in your project to do the scraping with this library instead of built-in solutions:
from bs4 import BeautifulSoup
def parse(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title
As you can see you can mix both approaches together so you do not have to learn a new syntax if you just want to switch over to Scrapy.
We have seen that even if Scrapy is more powerful sometimes you want to write your own code to handle everything what can go wrong. Sometimes you have to use BeautifulSoup with a custom, hand-crafted solution instead of an already existing power-tool. And you do not have sacrifice BeautifulSoup if you switch to Scrapy: you can use both together.
If you want to start with website scraping I suggest you to get started with Scrapy because in 90% you get everything you need and you can customize your solution easily. If you do not have everything there are ways to extend the existing features to fulfill your needs -- and if this does not help switch to a hand-crafted solution.