我正在用 scrapy 爬行这个page,并且我正在尝试提取主表的所有行。
以下 XPath 表达式应该给我想要的结果:
//div[@id='TableWithRules']//tbody/tr
使用 scrap shell 进行测试让我注意到这个表达式确实返回一个空数组:
#This response is empty: []
response.xpath("//div[@id='TableWithRules']//tbody").extract()
#This one is not:
response.xpath("//div[@id='TableWithRules']//thead").extract()
我猜网站所有者试图限制表数据的抓取,但是有什么方法可以找到解决方法吗?
如果您在控制台中执行此 JavaScript,则会从页面中提取所有名称和描述。
let trs = document.querySelectorAll('#TableWithRules tbody tr')
trs.forEach((el) => {
let tds = el.querySelectorAll('td')
let name = tds[0].innerText;
let description = tds[1].innerText;
console.log(name, description)
})
使用Selenium相同的代码,例如:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get("https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=hp")
trs = driver.find_elements(By.XPATH, "//div[@id='TableWithRules']//tbody//tr")
for tr in trs:
tds = tr.find_elements(By.XPATH, ".//td")
name = tds[0].text
description = tds[1].text
print(name, description)
driver.close()
输出
...
CVE-1999-0016 Land IP denial of service.
CVE-1999-0014 Unauthorized privileged access or denial of service via dtappgather program in CDE.
CVE-1999-0011 Denial of Service vulnerabilities in BIND 4.9 and BIND 8 Releases via CNAME record and zone transfer.
CVE-1999-0010 Denial of Service vulnerability in BIND 8 Releases via maliciously formatted DNS messages.
CVE-1999-0009 Inverse query buffer overflow in BIND 4.9 and BIND 8 Releases.
...