bking007 February 2016

How do we know when Heritrix completes a crawl job?

In our application, Heritrix is being used as the crawl engine and once the crawl job is finished, we are manually kicking off an endpoint to download the PDFs from a website. We would like to automate this downloading pdf task as soon as the crawl job is complete. Does HEritrix provide any URI/webservice method - which returns the status of the job? (or) Do we need to create a polling app to continuously monitor the status of the job?

Answers


zuups February 2016

I don't know if there is any option to do it without continious monitoring but you can use Heritrix API to get status for a job, smth like

curl -v -d "action=" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

gives you XML from where you can read job status.

Another, maybe easier (yet not so 'professional') option is to check if your jobs warcs directory contains a file with .open extension. If not - the job is finished.

Post Status

Asked in February 2016
Viewed 2,341 times
Voted 6
Answered 1 times

Search




Leave an answer