我正在使用硒抓取链接。我可以使用循环打印链接,但无法导航到它们,因为并获取所有信息我收到以下错误:消息:元素引用过时;要么元素不再附加到DOM,它不在当前框架上下文中,或者文档已刷新
from selenium import webdriver
author=[]
MAX_PAGE_NUM = 2
url=r"C:\\Users\\PERSONL\\Downloads\\geckodriver-v0.26.0-win64\\geckodriver.exe"
driver=webdriver.Firefox(executable_path=url)
with open('results.csv', 'w') as f:
f.write("Name")
for i in range(1, MAX_PAGE_NUM + 1):
url = url = "https://www.oddsportal.com/soccer/england/premier-league-2017-2018/results/" + "#/page/" + str(i)
driver.get(url)
names = driver.find_elements_by_xpath('//td[@class="name table-participant"]')
num_page_items = len(names)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
author.append(names[i].text)
f.write(names[i].text)
driver.close()
对脚本进行了一些调整。
避免使用StaleElementReferenceException
的关键是在收集names
之前允许加载表。为此,请在元素的可见性上使用WebDriverWait
。
您也可以直接遍历names
,而无需索引(请参见for name in names:
行)。我还添加了一个.rstrip()
,它会删除所收集文本中的所有尾随空格。您可以将其删除,然后查看您的.csv
看起来如何,以了解需要。
author=[]
MAX_PAGE_NUM = 2
with open('resultss.csv', 'w') as f:
f.write("Name\n")
for i in range(1, MAX_PAGE_NUM + 1):
url = "https://www.oddsportal.com/soccer/england/premier-league-2017-2018/results/" + "#/page/" + str(i)
driver.get(url)
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'table#tournamentTable')))
names = driver.find_elements_by_xpath('.//td[@class="name table-participant"]')
print(len(names))
print(names[0].text)
with open('resultss.csv', 'a') as f:
for name in names:
author.append(name.text.rstrip())
f.write(name.text.rstrip()+"\n")
driver.close()
WebDriverWait
需要这些导入:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait