Home Ask Login Register

Developers Planet

Your answer is one click away!

Giangio February 2016

Web scraping multiple pages with Beautiful Soup

this is the problem. I am writing some code to scrape some data from crowdcube.

The idea is to get the information Title, Description, target capital, raised capital and category

First I made an attempt on a single page. The code worked. Here it is:

from bs4 import BeautifulSoup
import urllib, re

data = {
        'title' : [],
        'description' : [],
        'target' : [],
        'raised':[],
        'category' : []
}

l=urllib.request.urlopen('https://www.crowdcube.com/investment/primo-18884')
    tree= BeautifulSoup(l, 'lxml')

#title
    title=tree.find_all('div',{'class':'cc-pitch__title'})

    data['title'].append(title[0].find('h2').get_text())    


#description
    description=tree.find_all('div',{'class':'fullwidth'})

    data['description'].append(description[1].find('p').get_text())

#target

    target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})

    data['target'].append(target[0].find('dd').get_text())

#raised

    raised=tree.find_all('div',{'class':'cc-pitch__raised'})

    data['raised'].append(raised[0].find('b').get_text())


#category

    category=tree.find_all('li',{'class':'sectors'})

    data['category'].append(category[0].find('span').get_text() )

data

I need to downlad the same information from all the projects on the website.

All links are included in this page: (https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7)

To do so, I started creating a list of urls with this code:

source= urllib.request.urlopen('https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7')

get_link= BeautifulSoup(source, 'lxml')

links_page = [a.attrs.get('href') for a in get_link.select('a[hre        

Answers


Padraic Cunningham February 2016

There are way more than three links on the page you linked to, I get 292, if you want to parse each of those do the following:

import requests
from bs4 import BeautifulSoup

url = "https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7"


def parse(so):
    return {'title': soup.title.text, 'description': so.find("div", {"class": "pitch-tabs"}).p.text,
            'target': so.find("div",{"class":"cc-pitch__stats clearfix"}).dd.text,
            'raised': so.find("div", {"class": "cc-pitch__raised"}).b.text,
            'category': " ".join(so.find("li",{"class":"sectors"}).span.text.split()),
            "title": so.title.text}


req = requests.get(url)

soup = BeautifulSoup(req.content)

links = {h.a["href"] for h in soup.find_all("h2", {"class": "pitch__title"})}

for link in links:
    print(link)
    soup = BeautifulSoup(requests.get(link).content)
    print(parse(soup))

A snippet of the output:

https://www.crowdcube.com/investment/property-moose-14045
{'category': u'Other, Internet Business, Technology', 'raised': u'\xa3169,010', 'target': u'\xa360,000', 'description': u'Property Moose is a new generation of property investment \u2013 taking the equity crowdfunding model and using it to allow users to invest in a wide range of properties from only \xa3500. Combining this with a fully integrated online platform, Property Moose aspires to take the Crowdfunding revolution by storm.', 'title': u'Property Moose raising \xa360,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/easyproperty-com-16655
{'category': u'Professional and Business Services, Internet Business', 'raised': u'\xa31,358,680', 'target': u'\xa31,000,000', 'description': u'easyProperty, the latest company from easyGroup, will offer individually priced property services. The venture, which has been founded by Sir Stelios (founder of easyJet)  

Post Status

Asked in February 2016
Viewed 1,622 times
Voted 12
Answered 1 times

Search




Leave an answer


Quote of the day: live life

Devs Planet ®

2014-2016 www.devsplanet.com

Devs Planet © all rights reserved

Search