Giangio February 2016

Python 3.x: 'ascii' codec can't encode character '\xfc' in position 18: ordinal not in range(128)

I alraedy checked existing questions. None of them works for me.

I wrote some code to scrape information from multiple pages in a website.

When I run the code, it returns this error: 'ascii' codec can't encode character '\xfc' in position 18: ordinal not in range(128)

When I test the code on a limited number of links it works.
The problem is probably this link:

'https://www.crowdcube.com/investment/brüpond-brewery-10622' 

Because there is the ü

In this specific case, I can drop that link and it is ok. However I would like to know how to handle this problem in general.

Here there is the code

from bs4 import BeautifulSoup
import urllib 
from time import sleep 
import re



def make_soup(url):
    html = urllib.request.urlopen(url)
    return BeautifulSoup(html, "lxml")

def get_links(section_url):
    get_link = make_soup(section_url)
    links_page = [a.attrs.get('href') for a in get_link.select('a[href]')]
    links_page = list(set(links_page))


    links = [l for l in links_page if 'https://www.crowdcube.com/investment/' in l] 

    return links

def get_data(url):
miss='.'
tree= make_soup(url)
try:
    #title
    title = tree.find_all('h2')[0].get_text()

    #description
    description=tree.find_all('div',{'class':'fullwidth'})
    description= description[1].find('p').get_text()
    description=re.sub(r'[^\w.]', ' ', description)   

   #location
    location=tree.find_all('div',{'class':'pitch-profile'})
    location=location[0].find('li').get_text()
    l=0
    loc=list(location)
    while l < len(loc):
       if loc[l]==',':
           loc[l]='-'
       l+=1   
    del(loc[0:10])
    location="".join(loc)
   #raised capital
    raised=tree.find_all('div',{'class':'cc-pitch__raised'})
    raised= raised[0].find('b').get_text()

    rais=list(raised)

    r=0
    while r < len(rais):
        if rais[r]==',':
           rais[r]='.'
        r+=1   

    cur        

Answers


René Kübler February 2016

urllib can't handle umlauts like the 'ü' in in the Url:

'https://www.crowdcube.com/investment/brüpond-brewery-10622'

Use the requests lib. The requests lib has no Problems with umlauts.

For example change your make_soup function to this:

import requests

def make_soup(url):
    html = requests.get(url).text
    return BeautifulSoup(html, "lxml")

Post Status

Asked in February 2016
Viewed 3,701 times
Voted 5
Answered 1 times

Search




Leave an answer