Blaszard February 2016

How to parse Web pages that don't return results initially in Python?

I want to load the list of images in this page in Python. However, when I opened the page in my browser (Chrome or Safari) and opened the dev tools, the inspector returned the list of images as <img class="grid-item--image">....

However, when I tried to parse it in Python, the result seemed different. Specifically, I got the list of images as <img class="carousel--image"...>, whereas the soup.findAll("img", "grid-item--image") did return an empty list. Also, I tried saving those images using its srcset tag, most of the saved images are NOT those that were listed on the web.

I think the web page used some sort of technics when rendering. How can I parse the web pages successfully?

I used BeautifulSoup 4 on Python 3.5. I loaded the page as follows:

import requests
from bs4 import BeautifulSoup
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")

return soup

Answers


Martin Evans February 2016

You would do better to use something like selenium for this as follows:

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
browser.get("http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi#collection")
html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")

for item in soup.find_all("img", {"class":"grid-item--image"}):
    print(item.get('srcset'))

This would display the following kind of output:

http://assets.vogue.com/photos/569d37e434324c316bd70f04/master/w_195/_FEN0016.jpg
http://assets.vogue.com/photos/569d37e5d928983d20a78e4f/master/w_195/_FEN0027.jpg
http://assets.vogue.com/photos/569d37e834324c316bd70f0a/master/w_195/_FEN0041.jpg
http://assets.vogue.com/photos/569d37e334324c316bd70efe/master/w_195/_FEN0049.jpg
http://assets.vogue.com/photos/569d37e702e08d8957a11e32/master/w_195/_FEN0059.jpg
...
...
...
http://assets.vogue.com/photos/569d3836486d6d3e20ae9625/master/w_195/_FEN0616.jpg
http://assets.vogue.com/photos/569d381834324c316bd70f3b/master/w_195/_FEN0634.jpg
http://assets.vogue.com/photos/569d3829fa6d6c9057f91d2a/master/w_195/_FEN0649.jpg
http://assets.vogue.com/photos/569d382234324c316bd70f41/master/w_195/_FEN0663.jpg
http://assets.vogue.com/photos/569d382b7dcd2a8a57748d05/master/w_195/_FEN0678.jpg
http://assets.vogue.com/photos/569d381334324c316bd70f2f/master/w_195/_FEN0690.jpg
http://assets.vogue.com/photos/569d382dd928983d20a78eb1/master/w_195/_FEN0846.jpg

This allows the full rendering of the page to take place inside the browser, and the resulting HTML can then be obtained.

Post Status

Asked in February 2016
Viewed 2,457 times
Voted 6
Answered 1 times

Search




Leave an answer