我正在学习python,试图从python.org粘贴搜索结果。我正在使用 Selenium
.
我想做的步骤。
我的代码。
from selenium import webdriver
import time
driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
#waiting to find the element before throwing error no element found
driver.implicitly_wait(10)
#driver.maximize_window()
#getting the website
driver.get("https://www.python.org/")
driver.implicitly_wait(5)
#finding element by id
driver.find_element_by_id("id-search-field").send_keys("arrays")
driver.find_element_by_id("submit").click()
print("Test Successful")
SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")
print(SearchResults.text)
-> 这将粘贴所有结果。
现在我想要单个的结果项和它们的标题。当我在网站上检查searchresults时,我得到了这样的结果。<a href="/dev/peps/pep-0209/">PEP 209 -- Multi-dimensional Arrays</a>
没有Tag,没有Class,也没有Name可以使用。
我如何使用这个方法来获取所有的标题?
你的 SearchResults 正在检索一个静态的 xpath,这个静态的 xpath 是 "main "标签,其中包含了你想要的结果列表:the
SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")
如果你检查那个搜索结果页面,你会发现在这个UL标签里面有几个"< li >",每个标签都包含一个"< h3 >",其中"< a > "包含一行 "head line"。从你问的内容来看,我估计这些都是你要捕捉的元素,所以你可以试试。
SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul/li[*]/h3/a")
或者是类似的东西
SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")
ChildResults = SearchResults.find_elements_by_xpath('.//*')
我没有真正测试过这段代码,但这个想法应该是可行的。至少在你第一次试用Selenium的时候是这样。我在这里的主要观点是:你试图读取一个元素列表,只寻找它们的父元素,你应该更进一步,寻找子元素。
虽然我确实建议你在网上搜索关于Selenium使用xpaths和搜索元素的最佳实践,但那种 "巨大的静态 "xpaths从长远来看会成为一场噩梦。你的元素标识符越灵活,就越容易维护你的代码,并使它在未来变得健壮。
你可以试试这个吗?与其使用Xpath,不如尝试使用一个CSS选择器,并分解每个元素。
from selenium import webdriver
import json
import time
driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Getting the website
driver.get("https://www.python.org/")
# Finding element by id
driver.find_element_by_id("id-search-field").send_keys("arrays")
driver.find_element_by_id("submit").click()
print("Test Successful")
for elem in driver.find_elements_by_css_selector("section.main-content ul li"):
elem_data = {
'title': elem.find_element_by_css_selector("h3").text,
'content': elem.find_element_by_css_selector("p").text,
'link': elem.find_element_by_css_selector("h3 a").get_attribute('href'),
}
print(json.dumps(elem_data, indent=4))
break
# {
# "title": "PEP 209 -- Multi-dimensional Arrays",
# "content": "...arrays comprised of simple types, like numeric. How are masked-arrays implemented? Masked-arrays in Numeric 1 are implemented as a separate array class. With the ability to add new array types to Numeric 2, it is possible that masked-arrays in Numeric 2 could be implemented as a new array type instead of an array class. How are numerical errors handled (IEEE floating-point errors in particular)? It is not clear to the proposers (Paul Barrett and Travis Oliphant) what is the best or preferre...",
# "link": "https://www.python.org/dev/peps/pep-0209/"
# }
如果你愿意的话,你可以使用selenium选择器的方法。
就我个人而言,我喜欢使用Javascript,然后注入Javascript并返回结果。
有一个包含以下内容的javascript文件
return (()=>{
parsed_results = [];
search_results=document.getElementsByClassName('list-recent-events')[0].children;
for(var i =0;i<search_results.length;i++){
result = search_results[i];
text = result.innerText;
title = result.getElementsByTagName('a')[0].innerText;
href = 'https://www.python.org'+ result.getElementsByTagName('a')[0].getAttribute('href');
parsed_results.push([title, text, href]);
}
return parsed_results;
})();
你可以像这样使用它, 在页面加载后。
search_results = driver.execute_script(open('path/to/file.js').read())
然后你就可以像在python中一样通过它们。
for r in search_results:
text = r[0]
href = r[1]
title = r[2]
要打印所有单个搜索结果的标题,使用 硒 和 蟒蛇 你要诱导 WebDriverWait 对于 visibility_of_all_elements_located()
您可以使用以下任何一种方式 定位策略:
使用 CSS_SELECTOR
:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.list-recent-events.menu li>h3>a")))])
使用 XPATH
:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='list-recent-events menu']//li/h3/a")))])
控制台输出。
['PEP 209 -- Multi-dimensional Arrays', 'PEP 207 -- Rich Comparisons', 'PEP 335 -- Overloadable Boolean Operators', 'PEP 535 -- Rich comparison chaining', 'Python Success Stories', 'PEP 574 -- Pickle protocol 5 with out-of-band data', 'Parade of the PEPs', 'PEP 3118 -- Revising the buffer protocol', 'PEP 465 -- A dedicated infix operator for matrix multiplication', 'PEP 358 -- The "bytes" Object', 'PEP 225 -- Elementwise/Objectwise Operators', 'Highlights: Python 2.4', 'PEP 211 -- Adding A New Outer Product Operator', 'EDU-SIG: Python in Education', 'PEP 204 -- Range Literals', 'PEP 455 -- Adding a key-transforming dictionary to collections', 'PEP 252 -- Making Types Look More Like Classes', 'PEP 586 -- Literal Types', 'PEP 579 -- Refactoring C functions and methods', 'PEP 3116 -- New I/O']
说明: : 你必须添加以下进口。
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC