我需要抓取此网页https://www.mef.gov.it/,当我尝试运行以下在搜索栏中搜索的代码时:
# find searchbar
click_on_search = driver.find_element(By.ID, "search-button")
click_on_search.click()
print('Search:', word)
searchbar = driver.find_element(By.ID, "strinput")
# put keyword in searchbar and press ENTER
searchbar.send_keys(word)
searchbar.send_keys(Keys.ENTER)
time.sleep(5) # wait for results
我收到以下错误:
ElementClickInterceptedException: element click intercepted: Element <a id="search-button" class="search-link rounded-icon" aria-label="Funzione di ricerca sul sito. Il sistema è basato sul motore di ricerca esterno di Google" href="#" data-bs-toggle="modal" data-bs-target="#search-modal" title="Ricerca">...</a> is not clickable at point (1809, 111). Other element would receive the click: <div class="cb-dialog-overlay"></div>
我正在使用一种方法来点击cookie横幅:
try:
print('Clicking cookie banner')
cookie_banner = driver.find_element(By.ID, "cb-close")
cookie_banner.click()
except Exception as e:
print('Exception:', e)
但是正如您在错误中看到的那样,点击被拦截,我无法抓取页面。有人可以帮助我吗?
最终我认为附加完整的代码会很有用,因为我不太擅长网络抓取。下面的代码也可能有错误:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time
# ---
import selenium
print('Selenium:', selenium.__version__)
# ---
def scrape_page(driver, keyword):
try:
try:
print('Clicking cookie banner')
cookie_banner = driver.find_element(By.ID, "cb-close")
cookie_banner.click()
except Exception as e:
print('Exception:', e)
elements_dt = driver.find_elements(By.XPATH, "//div[@class='gs-title']")
#elements_dd = driver.find_elements(By.XPATH, "//dl[@class='sample-list.results']/dd/a")
print('[DEBUG] len(elements_dt):', len(elements_dt))
# Lista per memorizzare i dati estratti
data = []
# Clicca su ciascun elemento
#for index, element_dt, element_dd in enumerate(zip(elements_dt, elements_dd), 1): # you can use `enumerate(..., 1)` to start `index` with `1`
for index, element in enumerate(elements_dt, 1): # you can use `enumerate(..., 1)` to start `index` with `1`
try:
article_url = element.find_element(By.XPATH, './/a').get_attribute("href")
article_title = element.text
# ... DON'T CLIK LINKS BECAUSE IT WILL REMOVE CURRENT PAGE FROM MEMPRY
# ... AND YOU WILL LOST ACCESS TO OTHER `elements` ON CURRENT PAGE
# ...
# ... Get `href` and later (after loop) use `.get(href)` to access subpages.
data.append({
'keyword': keyword,
'Titolo': article_title,
'URL': article_url,
#'Data': article_date,
#'Contenuto': article_content
})
print('[DEBUG] data:', data[-1])
# Torna alla pagina precedente
#driver.back()
except Exception as e:
print("Errore durante il clic sull'elemento:", e)
# work with subpages
# for item in data:
# print('[DEBUG] subpage:', item['URL'])
# driver.get(item['URL'])
# #article_date = ...
# #article_content = ...
# #item['Data'] = article_date
# #item['Contenuto'] = article_content
except Exception as e:
print("Errore durante lo scraping della pagina:", e)
return None
return data
# --- main ---
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
# ---
start_url = "https://www.mef.gov.it/index.html"
all_data = []
keywords = ['big data', 'machine learning', 'algoritm', 'calcolo', 'punteggio', 'predittiv', 'cloud', 'statistic',
'automa', 'internet delle cose', 'intelligenza artificiale']
for word in keywords:
print("Main Page:", start_url)
# open main page
driver.get(start_url)
# find searchbar
click_on_search = driver.find_element(By.ID, "search-button")
click_on_search.click()
print('Search:', word)
searchbar = driver.find_element(By.ID, "strinput")
# put keyword in searchbar and press ENTER
searchbar.send_keys(word)
searchbar.send_keys(Keys.ENTER)
time.sleep(5) # wait for results
#get current url (because it could load different URL to show results)
search_results_url = driver.current_url
# start scraping results (with pagination):
#while True: # try to get all pages
for _ in range(999): # try to get only 3 pages
print("Scraping:", search_results_url)
page_data = scrape_page(driver, word) # <--- only scraping, without `.get(url)`, I send `word` only to add it to `data`
if page_data:
all_data.extend(page_data)
driver.get(search_results_url) # go back to result after visiting subpages - to get link to next page
try:
next_page_link = driver.find_elements(By.XPATH, "//div[@class'gs-cursor-page']")
# search_results_url = next_page_link.get_attribute("href")
# driver.get(search_results_url) # <--- open next page with results using URL
next_page_link.click() # <--- or click link
except Exception as e:
print('[DEBUG] Exception:', e)
print('[DEBUG] break')
#input('Press ENTER to continue')
break # exit loop
driver.quit()
import pandas as pd
df = pd.DataFrame(all_data)
print(df)
input("Press ENTER to close")
df.to_excel('miur_scrape.xlsx')
<div class="cb-dialog-overlay"></div>
上面的元素正在拦截您的搜索元素点击。您可以通过关闭 cookie 弹出窗口来摆脱此元素。
虽然关闭 cookie 弹出窗口的代码看起来不错,但您可能需要添加 Selenium 的等待 来有效处理
click()
。
更改以下代码:
cookie_banner = driver.find_element(By.ID, "cb-close")
cookie_banner.click()
致:
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.ID, "cb-close"))).click()
需要进口:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC