用scrapy从多个页面中提取doi

问题描述 投票:0回答:1

我有这个网页(https://academic.oup.com/plphys/search-results?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1),我想从中提取信息,例如标题、姓名、doi ETC。 对于第一页,我可以轻松完成,但是由于页面更多,我无法爬行。我的代码是:

import scrapy

class PhotosynSpiderSpider(scrapy.Spider):
    name = 'photosyn_spider'    
    allowed_domains = ['https://academic.oup.com/plphys']
    start_urls = ['https://academic.oup.com/plphys/search-results?q=photosynthesis&allJournals=1&fl_SiteID=6323']

    def parse(self, response):
        # Step 1: Locate the first page in div class 'pageNumbers al-pageNumbers'
        page_numbers = response.css('div.pageNumbers.al-pageNumbers')
        current_page = page_numbers.css('span.current-page::text').get()
        total_pages = page_numbers.css('span.total-pages::text').get()

        # Step 2: Locate link in a class 'al-citation-list', and extract all the href for doi in the element 'a'
        citation_list = response.css('a.al-citation-list')
        dois = citation_list.css('a::attr(href)').getall()

        for doi in dois:
            yield {'doi': doi}

        # Step 3: Open url for the next page in the element 'a' and class 'sr-nav-next al-nav-next' and repeat step 2
        if current_page != total_pages:
            next_page_url = response.css('a.sr-nav-next.al-nav-next::attr(href)').get()
            yield scrapy.Request(next_page_url, callback=self.parse)

我正在尝试将结果转储到 json 文件中。然而,结果是空的。 谁能帮我这个? 谢谢

页面截图:

python web-scraping beautifulsoup scrapy web-crawler
1个回答
0
投票

如果您查看下一个页面元素,您会发现

href
属性不是实际的 url:

<a role="button" aria-label="Next" href="javascript:;" class="sr-nav-next al-nav-next" data-url="q=photosynthesis&amp;allJournals=1&amp;fl_SiteID=6323&amp;page=2" data-google-interstitial="false">
   Next
</a>

这是因为单击下一步按钮实际上不会将您带到新页面,而是使用 javascript 通过调用 ajax 来交换文章部分的内容。

使用 ajax 调用中使用的 url,我们可以通过匹配它的模式从后续页面获取所有结果。

例如:

import scrapy

class PhotosynSpiderSpider(scrapy.Spider):
    name = 'photosyn_spider'

    def start_requests(self):
        ajax_url = 'https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page='
        for i in range(1, 50):
            yield scrapy.Request(ajax_url + str(i))

    def parse(self, response):
        for row in response.css("div.sr-list.al-article-box.al-normal.clearfix"):
            doi = row.xpath(".//div[@class='al-citation-list']//a/@href").get()
            yield {"doi": doi}

第 1-2 页的输出:

{'doi': 'https://doi.org/10.1093/plphys/kiac484'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa026'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa032'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.120.2.599'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.109.139378'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.085167'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.085886'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.090449'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.119.2.553'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.015479'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.97.1.415'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.2.283'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.2.228'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.6.728'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.1.149'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.29.1.64'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.16.4.721'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa119'}
2023-05-09 23:07:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1> (referer: None)
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.73.4.1002'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.59.5.868'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.75.1.82'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.68.4.894'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.81.4.1115'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.59.5.859'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.93.4.1466'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.95.4.1270'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.48.6.712'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.89.2.409'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.89.4.1231'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.26.3.581'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.100.2.947'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.71.4.855'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.62.1.127'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.72.1.16'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.61.2.150'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.20.00264'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1093/plphys/kiac602'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1093/plphys/kiad183'}

注意:在撰写此答案时,该网站提供了一个验证码。如果您试图在验证码处于活动状态时抓取站点,您需要做的就是从浏览器复制 cookie 并将它们插入到 start_requests 方法中的每个请求中。

© www.soinside.com 2019 - 2024. All rights reserved.