Scrapy

by scrapy

framework for crawling web sites and extracting structured data

Excellent 10.0

(1 Ratings)

Helps with: Scraping,Web Frameworks

Similar to:

More...

Source Type: Open

License Types:

Proprietary

Supported OS:

Languages:

Save Unsave Compare

What is it all about?

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Key Features

* Fast and powerful - write the rules to extract the data and let Scrapy do the rest. * Easily extensible - extensible by design, plug new functionality easily without having to touch the core. * Portable, Python - written in Python and runs on Linux, Windows, Mac and BSD. * Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions. * An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders. * Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem) * Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations. * Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines). * Wide range of built-in extensions and middlewares for handling: cookies and session handling HTTP features like compression, authentication, caching, user-agent spoofing, robots.txt, crawl depth restriction * A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching * DNS resolver, and much more!