我试图使用纯 python 和 selenium(通过 rpaframework)废弃纽约时报的搜索结果,但我没有得到正确的结果。我需要获得标题、日期和描述。到目前为止,这是我的代码
当我打印标题时出现此错误
selenium.common.exceptions.InvalidArgumentException:消息:未知变体
//h4[@class='css-2fgx4k']
,预期为css selector
,link text
,partial link text
,tag name
,xpath
之一在第1行第37列
from RPA.Browser.Selenium import Selenium
# Search term
search_term = "climate change"
# Open the NY Times search page and search for the term
browser = Selenium()
browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)
# Find all the search result articles
articles = browser.find_elements("//ol[@data-testid='search-results']/li")
# Extract title, date, and description for each article and add to the list
for article in articles:
# Extract the title
title = article.find_element("//h4[@class='css-2fgx4k']")
print(title)
# Close the browser window
browser.close_all_browsers()
任何帮助将不胜感激。
我不是RPA框架的专家,但是你考虑过把你的代码简化成这样吗?您可能只需要定位搜索结果的
h4
标题标签:
from selenium.webdriver.common.by import By
# After you get the search results with this command:
# browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)
title_elements = browser.find_elements(By.TAG, "h4")
for title_element in title_elements:
print(title_element.text)
免责声明:我不确定上面的代码是否有效,因为我还没有测试它。
有了Browserist包,我已经测试过了,你只需要几行代码:
from browserist import Browser
from selenium.webdriver.common.by import By
search_term = "climate"
with Browser() as browser:
browser.open.url("https://www.nytimes.com/search?query=" + search_term)
title_elements = browser.get.elements_by_tag("h4")
for title_element in title_elements:
print(title_element.text)
这是我在终端中得到的结果:
坦白地说,我是 Browserist 包的作者。 Browserist 是 Selenium 网络驱动程序的轻量级、更简洁的扩展,它使浏览器自动化更加容易。只需使用
pip install browserist
安装包并试试这个:
from browserist import Browser
from selenium.webdriver.common.by import By
search_term = "climate"
# with Browser() as browser:
browser.open.url("https://www.nytimes.com/search?query=" + search_term)
search_result_elements = browser.get.elements("//ol[@data-testid='search-results']/li")
for element in search_result_elements:
try:
title = element.find_element(By.TAG_NAME, "h4").text
print(title)
except:
pass
备注:
climate
会产生更多但相关的结果,例如climate crisis
,但这取决于你改变。h4
标记标题而不是可能随时间更改的 CSS 标记值来定位标题更容易、更可靠。try
和 except
子句来防止破坏性错误。from browserist import Browser, BrowserType, BrowserSettings
...
with Browser(BrowserSettings(type=BrowserType.FIREFOX)) as browser:
这是我得到的,我希望你觉得它有用。如果您有任何问题,请告诉我?