Blaszard February 2016

How to parse Web pages that don't return results initially in Python?

I want to load the list of images in this page in Python. However, when I opened the page in my browser (Chrome or Safari) and opened the dev tools, the inspector returned the list of images as <img class="grid-item--image">....

However, when I tried to parse it in Python, the result seemed different. Specifically, I got the list of images as <img class="carousel--image"...>, whereas the soup.findAll("img", "grid-item--image") did return an empty list. Also, I tried saving those images using its srcset tag, most of the saved images are NOT those that were listed on the web.

I think the web page used some sort of technics when rendering. How can I parse the web pages successfully?

I used BeautifulSoup 4 on Python 3.5. I loaded the page as follows:

import requests
from bs4 import BeautifulSoup
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")

return soup


Martin Evans February 2016

You would do better to use something like selenium for this as follows:

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")

for item in soup.find_all("img", {"class":"grid-item--image"}):

This would display the following kind of output:

This allows the full rendering of the page to take place inside the browser, and the resulting HTML can then be obtained.

