如何修复 Scrapy-Selenium 不产生输出？

Question

Selenium 请求可以工作，但不能使用 scrapy-selenium。页面加载，我从网站收到 200 响应，但没有收到任何错误，因为它没有产生任何输出。

class SeamdbTestSpider(scrapy.Spider):
    name = 'steam_db_test'
    start_urls = ['https://steamdb.info/graph/']

    def start_requests(self):

        for link in self.start_urls:
            yield SeleniumRequest(
                url=link, 
                wait_time= 10,        
                callback=self.parse)
        
    def parse(self, response):
        driver = response.meta['driver']
        initial_page = driver.page_source
        r = Selector(text=initial_page)
        table = r.xpath('//*[@id="table-apps"]/tbody')
        rows = table.css('tr[class= "app"]')[0:2]

        for element in rows:
            info_link = "https://steamdb.info" + element.css('::attr(href)').get()
            name = element.css('a ::text').get()
            yield {"Name": name, "Link": info_link}

Answer 1

实际上，SeleniumRequest 与 scrapy 并不总是完美的。相同的 selement 选择与 bs4 一起使用 selenium，但像你一样与 scrapy 一起得到空输出。

Scrapy-SeleniumRequest 不起作用

import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest

class SeamdbTestSpider(scrapy.Spider):
    name = 'steam_db_test'
    start_urls = ['https://steamdb.info/graph/']

    def start_requests(self):

        for link in self.start_urls:
            yield SeleniumRequest(
                url=link, 
                wait_time= 10,        
                callback=self.parse)
        
    def parse(self, response):
        driver = response.meta['driver']
        initial_page = driver.page_source
        r = Selector(text=initial_page)
        rows = r.css('table#table-apps tbody tr')
        
        for element in rows:
            info_link = "https://steamdb.info" + element.css('td:nth-child(3) > a::attr(href)').get()
            name = element.css('td:nth-child(3) > a::text').get()
            yield {"Name": name, "Link": info_link}

Selenium 与 bs4 工作正常：

import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
#chrome to stay open
options.add_experimental_option("detach", True)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)

driver.get("https://steamdb.info/graph/")
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')

for tr in soup.select('table#table-apps tbody tr'):
    link=tr.select_one('td:nth-child(3) > a').get('href')
    link="https://steamdb.info" +  link
    name = tr.select_one('td:nth-child(3) > a').text
    print(link)
    print(name)

输出：

https://steamdb.info/app/730/graphs/
Counter-Strike: Global Offensive
https://steamdb.info/app/570/graphs/
Dota 2
https://steamdb.info/app/578080/graphs/
PUBG: BATTLEGROUNDS
https://steamdb.info/app/1172470/graphs/
Apex Legends
https://steamdb.info/app/1599340/graphs/
Lost Ark
https://steamdb.info/app/271590/graphs/
Grand Theft Auto V
https://steamdb.info/app/440/graphs/
Team Fortress 2
https://steamdb.info/app/1446780/graphs/
MONSTER HUNTER RISE
https://steamdb.info/app/346110/graphs/
ARK: Survival Evolved
https://steamdb.info/app/252490/graphs/
Rust
https://steamdb.info/app/431960/graphs/
Wallpaper Engine
https://steamdb.info/app/1506830/graphs/
FIFA 22
https://steamdb.info/app/1085660/graphs/
Destiny 2
https://steamdb.info/app/1569040/graphs/
Football Manager 2022
https://steamdb.info/app/230410/graphs/
Warframe
https://steamdb.info/app/1203220/graphs/
NARAKA: BLADEPOINT
https://steamdb.info/app/359550/graphs/
Tom Clancy's Rainbow Six Siege
https://steamdb.info/app/381210/graphs/
Dead by Daylight
https://steamdb.info/app/236390/graphs/

..等等

Answer 2

正确当我尝试抓取一个网站时，我遇到了同样的问题，我使用 scrapy-selenium 中间件来完成工作，对于我尝试生成的所有对象，我都没有得到任何结果然而，当我用 selenium 在 bs4 中编写同一个项目时，事情发生了变化我能够轻松地获得所需的所有信息，但它仍然崩溃了几次，尽管我能够通过this

将其整理出来

如何修复 Scrapy-Selenium 不产生输出？

问题描述投票：0回答：2

2个回答

最新问题

如何修复 Scrapy-Selenium 不产生输出？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2