Scrapy 响应返回一个空数组

问题描述 投票:0回答:1

我正在用 scrapy 爬行这个page,并且我正在尝试提取主表的所有行。

以下 XPath 表达式应该给我想要的结果:

//div[@id='TableWithRules']//tbody/tr

使用 scrap shell 进行测试让我注意到这个表达式确实返回一个空数组:

#This response is empty: []
response.xpath("//div[@id='TableWithRules']//tbody").extract()
#This one is not:
response.xpath("//div[@id='TableWithRules']//thead").extract()

我猜网站所有者试图限制表数据的抓取,但是有什么方法可以找到解决方法吗?

python shell web-scraping xpath scrapy
1个回答
0
投票

如果您在控制台中执行此 JavaScript,则会从页面中提取所有名称和描述。

let trs = document.querySelectorAll('#TableWithRules tbody tr')

trs.forEach((el) => {
    let tds = el.querySelectorAll('td')
    let name = tds[0].innerText;
    let description = tds[1].innerText;
    console.log(name, description)
})

使用Selenium相同的代码,例如:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get("https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hp")

trs = driver.find_elements(By.XPATH, "//div[@id='TableWithRules']//tbody//tr")
for tr in trs:
    tds = tr.find_elements(By.XPATH, ".//td")
    name = tds[0].text
    description = tds[1].text
    print(name, description)

driver.close()

输出

...
CVE-1999-0016 Land IP denial of service.
CVE-1999-0014 Unauthorized privileged access or denial of service via dtappgather program in CDE.
CVE-1999-0011 Denial of Service vulnerabilities in BIND 4.9 and BIND 8 Releases via CNAME record and zone transfer.
CVE-1999-0010 Denial of Service vulnerability in BIND 8 Releases via maliciously formatted DNS messages.
CVE-1999-0009 Inverse query buffer overflow in BIND 4.9 and BIND 8 Releases.
...
© www.soinside.com 2019 - 2024. All rights reserved.