Home > {{category.CategoryName}} > Apache Nutch
Apache Nutch Scraping App

Apache Nutch

by Apache

A well-matured, production-ready Web Crawler.
Helps with: Scraping
Similar to: Aviva Host Integration SDK App AlchemyAPI App Connotate App Screen Scraping App More...
Source Type: Open
License Types:
Supported OS:
Languages: Other

What is it all about?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Key Features

* Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch. * Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search. * The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP. * Distributed filesystem (via Hadoop) * Link-graph database * NTLM authentication


Trial With Card
Trial No Card
By Quote


Free Trial No Card,
By Quotation


View More Alternatives

View Less Alternatives

Top DiscoverSDK Experts

User photo
Gábor László Hajba
Well-grounded software developer
Data Handling | Web and 17 more
View Profile
User photo
Noor Khan
Senior Software Engineer (Web)
GUI | Data Handling and 17 more
View Profile
User photo
Billy Joel Ranario
Full Stack Web Developer and Article Writer
GUI | Data Handling and 31 more
View Profile
User photo
Jeamar Paul Libres
Software Engineer, Web Developer, Android Developer
GUI | Web and 15 more
View Profile
Show All

Interested in becoming a DiscoverSDK Expert? Learn more


Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.


Now comparing:

{{product.ProductName | createSubstring:25}} X
Compare Now