Home Ask Login Register

Developers Planet

Your answer is one click away!

Pyderman February 2016

Detecing forms (and filling them in) with Scrapy

I'm struggling to find a generic approach to detecting a form in HTML and then submitting it. When the page structure is know in advance for a given page, we of course have several options:

-- Selenium/Webdriver (by filling in the fields and 'clicking' the button)

-- Determining the form of the POST query manually, then reconstructing it with urllib2 directly:

import urllib2
import urllib
import lxml.html as LH

url = "http://apply.ovoenergycareers.co.uk/vacancies/#results"
params = urllib.urlencode([('field_36[]', 73), ('field_37[]', 76),   
('field_32[]', 82)])
response = urllib2.urlopen(url, params)

or with Requests:

import requests
r = requests.post("http://apply.ovoenergycareers.co.uk/vacancies/#results", data = 'Manager')

But although most forms involve a POST request, some input fields and a submit button, they vary greatly in their implementation under the hood. When the number of pages to be scraped gets into the hundreds, it's not feasible to define a custom form-filling approach for each.

My understanding is that Scrapy's main added value is its ability to follow links. I presume that this would also include links ultimately arrived at via form submission. Can this ability then be used to build a generic approach to "following" a form submission?

CLARIFICATION: In the case of a form with several dropdown menus, I will typically be leaving these at their default value, and only filling in the search term input field. So locating this field and 'filling it in' is ultimately the main challenge here.


alecxe February 2016

Link Extractors cannot follow the form submissions in Scrapy. There is an another mechanism called FormRequest that is specifically designed to ease submitting forms.

Note that FormRequests cannot handle forms when JavaScript is involved in the submission.

raggingfox February 2016

You can look into Selenium with PhantomJS. It can handle JS and then you can use the CSS selectors from Selenium to pick specific elements on the webpage.

Post Status

Asked in February 2016
Viewed 1,185 times
Voted 9
Answered 2 times


Leave an answer

Quote of the day: live life