Compare Products
![]() |
![]() |
Features * Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch.
* Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search.
* The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
* Distributed filesystem (via Hadoop)
* Link-graph database
* NTLM authentication
|
Features * Inject code and control the browser with JavaScript - Espion is a headless browser that enables you to inject JavaScript code directly into your target web pages.
* Schedule and monitor recurring jobs - No need to worry about CPUs, RAM, network access or data storage — your scraping jobs run in the cloud. Schedule them at any time in your dashboard or from the REST API. Monitor progress and debug running jobs in real-time with the console.
* Detect web site changes and measure data quality - Data quality is difficult to guarantee when you depend on third-party or hostile websites. Espion lets you define rules to measure how your data feeds look over time.
* Output ready-to-use data - Scraping doesn’t stop until your data is delivered in a structured, convenient format.
* Use the cloud for computing, storage and IP addresses - Your data resides in the cloud – shared or private – where resource availability stretches to meet your needs. No setup or provisioning is required so you can concentrate on building web scraping applications that get the job done.
* Extract text from images and solve CAPTCHAs - Built-in OCR means images can’t hide text from your code. Complex CAPTCHAs can be solved by plugging into an external CAPTCHA-solving API, whether automated or based on human agents.
* Manage your footprint to evade detection - Espion gives you full control over the browser. When the defaults don’t fit your needs, you can modify any HTTP header, choose the rate at which you hit pages, control cookies or decide how many times to retry a page when errors occur. You decide how much your scraper should behave like a human user to fool the best anti-bot technology.
|
LanguagesOther |
LanguagesJava Script |
Source TypeOpen
|
Source TypeClosed
|
License TypeApache |
License TypeProprietary |
OS Type |
OS Type |
Pricing
|
Pricing
|
X
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}Now comparing:
{{product.ProductName | createSubstring:25}} X