无法获取第2页的下一页正文。

问题描述 投票:0回答:1

第1页第2页 的URL.我想从第1个URL中获取所有内容,只从第2个URL中获取正文,并将其追加到第1个URL的正文中。我想从第1个URL中获取所有内容,只从第2个URL中获取正文,并将其附加到第1个URL的正文中。这只是一篇文章,函数parse_indianexpress_archive_links()包含一个新闻文章的URL列表。我从page1得到所有的结果,但是page2的next_maintext列的结果输出为 <GET http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/2>

class spider_indianexpress(scrapy.Spider):
name = 'indianexpress'
start_urls = parse_indianexpress_archive_links()

def parse(self,response):
    items = ScrapycrawlerItem()
    separator = ''

    #article_url = response.xpath("//link[@rel = 'canonical']/@href").extract_first()
    article_url = response.request.url


    date_updated = max(response.xpath("//div[@class = 'story-date']/text()").extract() , key=len)[-27:]  #Call max(list, key=len) to return the longest string in list by comparing the lengths of all strings in a list
    if len(date_updated) <=10:
        date_updated = max(response.xpath("//div[@class = 'story-date']/p/text()").extract() , key=len)[-27:]

    headline = response.xpath("(//div[@id = 'ie2013-content']/h1//text())").extract()
    headline=separator.join(headline)

    image_url = response.css("div.storybigpic.ssss img").xpath("@src").extract_first()        

    maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
    maintext = ' '.join(map(str, maintext))
    maintext = maintext.replace('\r','')
    contd = response.xpath("//div[@class = 'ie2013-contentstory']/p[@align = 'right']/text()").extract_first()




    items['date_updated'] = date_updated
    items['headline'] = headline
    items['maintext'] = maintext
    items['image_url'] = image_url
    items['article_url'] = article_url

    next_page_url = response.xpath("//a[@rel='canonical']/@href").extract_first()

    if next_page_url :
        items['next_maintext'] = scrapy.Request(next_page_url , callback = self.parse_page2)

    yield items

def parse_page2(self, response):
    next_maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
    next_maintext = ' '.join(map(str, next_maintext))
    next_maintext = next_maintext.replace('\r','')
    yield {next_maintext}

输出。

article_url,date_publish,date_updated,description,headline,image_url,maintext,next_maintext

http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/,,"Fri Apr 03 2009, 14:49 hrs ",,Congress approves 2010 budget plan,http://static.indianexpress.com/m-images/M_Id_69893_Obama.jpg,"The Democratic-controlled US Congress on Thursday approved budget blueprints embracing President Barack Obama's agenda but leaving many hard choices until later and a government deeply in the red. With no Republican support, the House of Representatives and Senate approved slightly different, less expensive versions of Obama's $3.55 trillion budget plan for fiscal 2010, which begins on October 1. The differences will be worked out over the next few weeks. Obama, who took office in January after eight years of the Republican Bush presidency, has said the Democrats' budget is critical to turning around the recession-hit US economy and paving the way for sweeping healthcare, climate change and education reforms he hopes to push through Congress this year. Obama, traveling in Europe, issued a statement praising the votes as ""an important step toward rebuilding our struggling economy."" Vice President Joe Biden, who serves as president of the Senate, presided over that chamber's vote. Democrats in both chambers voted down Republican alternatives that focused on slashing massive deficits with large cuts to domestic social spending but also offered hefty tax breaks for corporations and individuals. ""Democrats know that those policies are the wrong way to go,"" House Majority Leader Steny Hoyer told reporters. ""Our budget lays the groundwork for a sustained, shared and job-creating recovery."" But Republicans have argued the Democrats' budget would be a dangerous expansion of the federal government and could lead to unnecessary taxes that would only worsen the country's long-term fiscal situation. ""The Democrat plan to increase spending, to increase taxes, and increase the debt makes no difficult choices,"" said House Minority Leader John Boehner. ""It's a roadmap to disaster."" The budget measure is nonbinding but it sets guidelines for spending and tax bills Congress will consider later this year. BIPARTISANSHIP ABSENT AGAIN Obama has said he hoped to restore bipartisanship when he arrived in Washington but it was visibly absent on Thursday. ... contd.",<GET http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/2>
xpath scrapy web-crawler html-parsing
1个回答
1
投票

这不是Scrapy的工作方式(我是说next_page请求) 如何在Scrapy上同步获取Request的Response对象?.

但事实上,你并不需要同步请求。你只需要检查下一个页面,并通过当前状态(item)到回调,以处理你的下一个页面。我使用的是 cb_kwargs (这是现在推荐的一种方式)。您可能需要使用 request.meta 如果你有一个旧版本。

import scrapy

class spider_indianexpress(scrapy.Spider):
    name = 'indianexpress'
    start_urls = ['http://archive.indianexpress.com/news/congress-approves-2010-budget-plan/442712/']

    def parse(self,response):
        item = {}
        separator = ''

        #article_url = response.xpath("//link[@rel = 'canonical']/@href").extract_first()
        article_url = response.request.url

        date_updated = max(response.xpath("//div[@class = 'story-date']/text()").extract() , key=len)[-27:]  #Call max(list, key=len) to return the longest string in list by comparing the lengths of all strings in a list
        if len(date_updated) <=10:
            date_updated = max(response.xpath("//div[@class = 'story-date']/p/text()").extract() , key=len)[-27:]

        headline = response.xpath("(//div[@id = 'ie2013-content']/h1//text())").extract()
        headline=separator.join(headline)

        image_url = response.css("div.storybigpic.ssss img").xpath("@src").extract_first()        

        maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
        maintext = ' '.join(map(str, maintext))
        maintext = maintext.replace('\r','')
        contd = response.xpath("//div[@class = 'ie2013-contentstory']/p[@align = 'right']/text()").extract_first()

        item['date_updated'] = date_updated
        item['headline'] = headline
        item['maintext'] = maintext
        item['image_url'] = image_url
        item['article_url'] = article_url

        next_page_url = response.xpath('//a[@rel="canonical"][@id="active"]/following-sibling::a[1]/@href').extract_first()

        if next_page_url :
            yield scrapy.Request(
                url=next_page_url, 
                callback = self.parse_next_page,
                cb_kwargs={
                    'item': item,
                }
            )
        else:
            yield item

    def parse_next_page(self, response, item):
        next_maintext = response.xpath("//div[@class = 'ie2013-contentstory']//p//text()").extract()
        next_maintext = ' '.join(map(str, next_maintext))
        next_maintext = next_maintext.replace('\r','')
        item["maintext"] += next_maintext

        next_page_url = response.xpath('//a[@rel="canonical"][@id="active"]/following-sibling::a[1]/@href').extract_first()
        if next_page_url :
            yield scrapy.Request(
                url=next_page_url, 
                callback = self.parse_next_page,
                cb_kwargs={
                    'item': item,
                }
            )
        else:
            yield item
© www.soinside.com 2019 - 2024. All rights reserved.