I have used Apache Nutch to crawl around 600k urls in 3 hours. I did use nutch in a Hadoop cluster. However, a friend of mine has used his own in house crawler to crawl 1 millions records in 1 hour.
Ideally, we this scale of your records, you will need a distributed solution. As a result, I recommend using nutch as it suits your need. You could also try Storm crawler. It is similar to Nutch but uses storm as its distribution engine.You might think building a new one is a better option, but I do not think so. Nutch is indeed a matured and scalable solution.
Do you need Solr? it all depends in what you want to do with the end result. If you want to search it, then you will need Solr or Elasticsearch. If you want to crawl and push data into a database then you need to build a new indexer that pushes crawled data to your desired data sink. Which is possible in Nutch.
As mentioned by @ameertawfik you could write a custom indexer class in Nutch so that it sends whichever data you need to keep into a DB. Whether you could use SOLR or ES as indexing backends depends on how you need to use the data.
You could do the NLP processing as part of the Nutch parsing or indexing steps by implementing custom HTMLParseFilters or IndexingFilters. Either way would work fine and since it would be running on Hadoop, the NLP part would also scale.
Nutch would work fine but is probably an overkill for what you need to do. If you know that the URLs you want to fetch are new then I'd just go for StormCrawler instead. Your crawls are non-recursive so it could be a matter of having N queues with RabbitMQ (or any other distributed queue) and inject your URLs daily into the queues based e.g. on their hostname. The SC topology would then require a custom Spout to connect to RabbitMQ (there are examples on the web you can use as starting points), most of the other components would be the standard ones, except a couple of bolts for doing the NLP work + another one to send the results to the storage of your choice.
Either Nutch or SC would be good ways of doing it. SC would certainly be faster and probably easier to understand. I wrote a tutorial a few months ago which describes and compares Nutch and SC, this might be useful for understanding how they differ.
Asked in February 2016Viewed 2,826 timesVoted 4Answered 2 times