federico Guastadisegni February 2016

Scrapy doesn't get all data

im trying to scrape this page:

http://binpar.caicyt.gov.ar/cgi-bin/koha/opac-detail.pl?biblionumber=98723

with this code:

def parse_web9(self, response): #Conicet!!

    for publication in response.css('div#wrap > div.main > div.container-fluid > div.row-fluid > div.span9 > div#catalogue_detail_biblio > div.record'):

        pubtitle = publication.xpath('./h1[@class="title"]/text()').extract_first()

        author = publication.xpath('./span[@class="results_summary publisher"]/span/span/a/text()').extract()

        isxn = publication.xpath('./span[@class="results_summary issn"]/span/text()').re(r'\d+-\d+')

        yield{
            'titulo_publicacion': pubtitle,
            'anio_publicacion': None,
            'isbn': isxn,
            'nombre_autor': author,
            'url_link' : None
        }

But I 'm getting only the title of the publication, I'm not sure why.

Cheers!

Answers


alecxe February 2016

You should get the inner fields by property attributes:

$ scrapy shell http://binpar.caicyt.gov.ar/cgi-bin/koha/opac-detail.pl?biblionumber=98723
>>> for publication in response.css('div#wrap > div.main > div.container-fluid > div.row-fluid > div.span9 > div#catalogue_detail_biblio > div.record'):
...     author = publication.css("span[property=contributor] span[property=name]::text").extract_first()
...     title = publication.css("h1[property=name]::text").extract_first()
...     issn = publication.css("span[property=issn]::text").extract_first()
...     print(author, title, issn)
... 
(u'Asociaci\xf3n Filat\xe9lica de la Rep\xfablica Argentina', u'AFRA, bolet\xedn informativo. ', u'0001-1193.')

Post Status

Asked in February 2016
Viewed 2,715 times
Voted 7
Answered 1 times

Search




Leave an answer