Gideon February 2016

Web scraping in Java/Scala

I need to extract keywords, title and description of a long list of URLs (initially ~250,000 URLs per day and eventually ~15,000,000 URLs per day)

How would you recommend executing this? Preferably and solution that could be extended to 15,000,000 events per day. Preferably in Scala or Java

So far I've looked at:

  • Spray - I'm not very familiar with Spray yet so I can't quite evaluate it. Is it a useful framework for my task?
  • Vertx - I've worked with Vertx before, if this is a good framework could you explain how would to be the best way to implement it with Vertx?
  • Scala scraper - Not familiar with it at all. Is it a good framework for the use case and the loads I need?
  • Nutch - I'm not sure how good it will be if I want to use it inside my code. Also I'm not sure I need Solr for my usecase. Anyone had any experience with it?

I'll be happy to hear of other options if you think they're better

I know I can probably dig in each of these solution and decide whether it's good or not but there seem to be so many options so any direction will be appreciated.

Thanks in advance

Answers


ameertawfik February 2016

I have used Apache Nutch to crawl around 600k urls in 3 hours. I did use nutch in a Hadoop cluster. However, a friend of mine has used his own in house crawler to crawl 1 millions records in 1 hour.

Ideally, we this scale of your records, you will need a distributed solution. As a result, I recommend using nutch as it suits your need. You could also try Storm crawler. It is similar to Nutch but uses storm as its distribution engine.You might think building a new one is a better option, but I do not think so. Nutch is indeed a matured and scalable solution.

Do you need Solr? it all depends in what you want to do with the end result. If you want to search it, then you will need Solr or Elasticsearch. If you want to crawl and push data into a database then you need to build a new indexer that pushes crawled data to your desired data sink. Which is possible in Nutch.

I hope that helps.


Julien Nioche February 2016

As mentioned by @ameertawfik you could write a custom indexer class in Nutch so that it sends whichever data you need to keep into a DB. Whether you could use SOLR or ES as indexing backends depends on how you need to use the data.

You could do the NLP processing as part of the Nutch parsing or indexing steps by implementing custom HTMLParseFilters or IndexingFilters. Either way would work fine and since it would be running on Hadoop, the NLP part would also scale.

Nutch would work fine but is probably an overkill for what you need to do. If you know that the URLs you want to fetch are new then I'd just go for StormCrawler instead. Your crawls are non-recursive so it could be a matter of having N queues with RabbitMQ (or any other distributed queue) and inject your URLs daily into the queues based e.g. on their hostname. The SC topology would then require a custom Spout to connect to RabbitMQ (there are examples on the web you can use as starting points), most of the other components would be the standard ones, except a couple of bolts for doing the NLP work + another one to send the results to the storage of your choice.

Either Nutch or SC would be good ways of doing it. SC would certainly be faster and probably easier to understand. I wrote a tutorial a few months ago which describes and compares Nutch and SC, this might be useful for understanding how they differ.

Post Status

Asked in February 2016
Viewed 2,826 times
Voted 4
Answered 2 times

Search




Leave an answer