Lo Bellin February 2016

Scraping with a multithreaded queue + urllib3 suffers a drastic slowdown

I am trying to scrape a huge number of URLs (approximately 3 millions) that contains JSON-formatted data in the shortest time possible. To achieve this, I have a Python code (python 3) that uses Queue, Multithreading and Urllib3. Everything works fine during the first 3 min, then the code begins to slow down, then it appears to be totally stuck. I have read everything I could find on this issue but unfortunately the solution seems to requires a knowledge which lies far beyond me.

I tried to limit the number of threads : it did not fix anything. I also tried to limit the maxsize of my queue and to change the socket timeout but it did no help either. The distant server is not blocking me nor blacklisting me, as I am able to re-launch my script any time I want with good results in the beggining (the code starts to slow down at pretty random time). Besides, sometimes my internet connection seems to be cut - as I cannot surf on any website - but this specific issue does not appear every time.

Here is my code (easy on me please, I'm a begginer):

#!/usr/bin/env python
import urllib3,json,csv
from queue import Queue
from threading import Thread

csvFile =  open("X.csv",  'wt',newline="")
writer  =  csv.writer(csvFile,delimiter=";")
         writer.writerow(('A','B','C','D'))

def do_stuff(q):
    http = urllib3.connectionpool.connection_from_url('http://www.XXYX.com/',maxsize=30,timeout=20,block=True)

    while True:

        try:

            url = q.get()
            url1 = http.request('GET',url)      

            doc = json.loads(url1.data.decode('utf8'))

            writer.writerow((doc['A'],doc['B'], doc['C'],doc['D']))

        except:
             print(url)

        finally:
            q.task_done()

q = Queue(maxsize=200)
num_threads = 15

for i in range(num_threads):
    worker = Thread(target=do_stuff, args=(q,))
    worker.setDaemon(True)
    worker.start()

for x in range(1,3000000):
    if x < 10:
        url = "http:        

Answers


thodnev February 2016

As shazow said, it's not the matter of threads, but timeouts at which each thread is getting data from server. Try to include some timout in your code:

finally:
    sleep(50)
    q.task_done()

it also could be improved by generating adaptive timeouts, for example You could measure how much data you successfully got, and if that number decreases, increase sleep time, and vice versa

Post Status

Asked in February 2016
Viewed 1,579 times
Voted 8
Answered 1 times

Search




Leave an answer