GBR24 February 2016

Distribute web-scraping write-to-file to parallel processes in Python?

I'm scraping some JSON data from a website, and need to do this ~50,000 times (all data is for distinct zip codes over a 3-year period). I timed out the program for about 1,000 calls, and the average time per call was 0.25 seconds, leaving me with about 3.5 hours of runtime for the whole range (all 50,000 calls).

How can I distribute this process across all of my cores? The core of my code is pretty much this:

with open("U:/dailyweather.txt", "r+") as f:
    f.write("var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7\tvar8\tvar9\n")
    writeData(zips, zip_weather_links, daypart)

Where writeData() looks like this:

def writeData(zipcodes, links, dayparttime):
    for z in zipcodes:
        for pair in links:
            ## do some logic ##
            f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (var1, var2, var3, var4, var5, 
                                                              var6, var7, var8, var9))

zips looks like this:

zips = ['55111', '56789', '68111', ...]

and zip_weather_links is just a dictionary of (URL, date) for each zip code:

zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...]

How can I distribute this using Pool or multiprocessing? Or would distribution even save time?

Answers


René Kübler February 2016

You want to "Distribute web-scraping write-to-file to parallel processes in Python". For a start let's look where the most time is used for Web-Scraping.

The latency for the HTTP-Requests is much higher than for Harddisks. Link: Latency comparison. Small writes to a Harddisk are significantly slower than bigger writes. SSDs have a much higher random write speed so this effect doesn't affect them so much.

  1. Distribute the HTTP-Requests
  2. Collect all the results
  3. Write all the results at once to disk

some example code with IPython parallel:

from ipyparallel import Client
import requests
rc = Client()
lview = rc.load_balanced_view()
worklist = ['http://xkcd.com/614/info.0.json',
            'http://xkcd.com/613/info.0.json']

@lview.parallel()
def get_webdata(w):
    import requests
    r = requests.get(w)
    if not r.status_code == 200:
         return (w, r.status_code,)
    return (w, r.json(),)

#get_webdata will be called once with every element of the worklist
proc = get_webdata.map(worklist) 
results = proc.get()
# results is a list with all the return values
print(results[1])
# TODO: write the results to disk

You have to start the IPython parallel workers first:

(py35)River:~ rene$ ipcluster start -n 20     

Post Status

Asked in February 2016
Viewed 2,842 times
Voted 10
Answered 1 times

Search




Leave an answer