Henk February 2016

Python LinkExtractor to go to next pages doesn't work

Next is a piece of code i have to try crawling a site with more then 1 page... i'm having troubles getting the rule class working. What am i doing wrong?

#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
    ]

#    def parse_item(self, response):
    def parse(self, response):
        #self.logger.info('Hi, this is an item page! %s', response.url)
        x = 0
        items = []
        for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
            x = x + 1
            item = SkodaItem()
            item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
            #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
            item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
            item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
            item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')

            #handle output (print or safe to database)
            items.append(item)
        

Answers


paul trmbrth February 2016

A few things to change:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

  • as I mentioned in the comments, your XPath needs fixing by removing the extra /a at the end (links in links will not match any element)
  • CrawlSpider rules need a callback method if you want to extract items from the followed pages
  • to also parse elements from the start URLs, you need to define a parse_start_url method

This is a minimalistic CrawlSpider following the 3 pages from your sample input, and printing out how many "articles" there are in each page:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
             follow=True,
             callback='parse_page'),
    ]

    def parse_page(self, response):
        articles = re 

Post Status

Asked in February 2016
Viewed 2,577 times
Voted 14
Answered 1 times

Search




Leave an answer