Selenium 请求可以工作,但不能使用 scrapy-selenium。页面加载,我从网站收到 200 响应,但没有收到任何错误,因为它没有产生任何输出。
class SeamdbTestSpider(scrapy.Spider):
name = 'steam_db_test'
start_urls = ['https://steamdb.info/graph/']
def start_requests(self):
for link in self.start_urls:
yield SeleniumRequest(
url=link,
wait_time= 10,
callback=self.parse)
def parse(self, response):
driver = response.meta['driver']
initial_page = driver.page_source
r = Selector(text=initial_page)
table = r.xpath('//*[@id="table-apps"]/tbody')
rows = table.css('tr[class= "app"]')[0:2]
for element in rows:
info_link = "https://steamdb.info" + element.css('::attr(href)').get()
name = element.css('a ::text').get()
yield {"Name": name, "Link": info_link}
实际上,SeleniumRequest 与 scrapy 并不总是完美的。相同的 selement 选择与 bs4 一起使用 selenium,但像你一样与 scrapy 一起得到空输出。
Scrapy-SeleniumRequest 不起作用
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
class SeamdbTestSpider(scrapy.Spider):
name = 'steam_db_test'
start_urls = ['https://steamdb.info/graph/']
def start_requests(self):
for link in self.start_urls:
yield SeleniumRequest(
url=link,
wait_time= 10,
callback=self.parse)
def parse(self, response):
driver = response.meta['driver']
initial_page = driver.page_source
r = Selector(text=initial_page)
rows = r.css('table#table-apps tbody tr')
for element in rows:
info_link = "https://steamdb.info" + element.css('td:nth-child(3) > a::attr(href)').get()
name = element.css('td:nth-child(3) > a::text').get()
yield {"Name": name, "Link": info_link}
Selenium 与 bs4 工作正常:
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
#chrome to stay open
options.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
driver.get("https://steamdb.info/graph/")
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
for tr in soup.select('table#table-apps tbody tr'):
link=tr.select_one('td:nth-child(3) > a').get('href')
link="https://steamdb.info" + link
name = tr.select_one('td:nth-child(3) > a').text
print(link)
print(name)
输出:
https://steamdb.info/app/730/graphs/
Counter-Strike: Global Offensive
https://steamdb.info/app/570/graphs/
Dota 2
https://steamdb.info/app/578080/graphs/
PUBG: BATTLEGROUNDS
https://steamdb.info/app/1172470/graphs/
Apex Legends
https://steamdb.info/app/1599340/graphs/
Lost Ark
https://steamdb.info/app/271590/graphs/
Grand Theft Auto V
https://steamdb.info/app/440/graphs/
Team Fortress 2
https://steamdb.info/app/1446780/graphs/
MONSTER HUNTER RISE
https://steamdb.info/app/346110/graphs/
ARK: Survival Evolved
https://steamdb.info/app/252490/graphs/
Rust
https://steamdb.info/app/431960/graphs/
Wallpaper Engine
https://steamdb.info/app/1506830/graphs/
FIFA 22
https://steamdb.info/app/1085660/graphs/
Destiny 2
https://steamdb.info/app/1569040/graphs/
Football Manager 2022
https://steamdb.info/app/230410/graphs/
Warframe
https://steamdb.info/app/1203220/graphs/
NARAKA: BLADEPOINT
https://steamdb.info/app/359550/graphs/
Tom Clancy's Rainbow Six Siege
https://steamdb.info/app/381210/graphs/
Dead by Daylight
https://steamdb.info/app/236390/graphs/
..等等
正确 当我尝试抓取一个网站时,我遇到了同样的问题,我使用 scrapy-selenium 中间件来完成工作,对于我尝试生成的所有对象,我都没有得到任何结果 然而,当我用 selenium 在 bs4 中编写同一个项目时,事情发生了变化 我能够轻松地获得所需的所有信息,但它仍然崩溃了几次,尽管我能够通过this
将其整理出来