单击 cookie 横幅错误并在 Selenium 中启用搜索栏

问题描述 投票:0回答:1

我需要抓取此网页https://www.mef.gov.it/,当我尝试运行以下在搜索栏中搜索的代码时:

# find searchbar
click_on_search = driver.find_element(By.ID, "search-button")
click_on_search.click()
print('Search:', word)
searchbar = driver.find_element(By.ID, "strinput")
# put keyword in searchbar and press ENTER
searchbar.send_keys(word)
searchbar.send_keys(Keys.ENTER)

time.sleep(5) # wait for results

我收到以下错误:

ElementClickInterceptedException: element click intercepted: Element <a id="search-button" class="search-link rounded-icon" aria-label="Funzione di ricerca sul sito. Il sistema è basato sul motore di ricerca esterno di Google" href="#" data-bs-toggle="modal" data-bs-target="#search-modal" title="Ricerca">...</a> is not clickable at point (1809, 111). Other element would receive the click: <div class="cb-dialog-overlay"></div>

我正在使用一种方法来点击cookie横幅:

try:
    print('Clicking cookie banner')            
    cookie_banner = driver.find_element(By.ID, "cb-close")
    cookie_banner.click()
except Exception as e:
    print('Exception:', e)

但是正如您在错误中看到的那样,点击被拦截,我无法抓取页面。有人可以帮助我吗?

最终我认为附加完整的代码会很有用,因为我不太擅长网络抓取。下面的代码也可能有错误:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC
#from selenium.common.exceptions import NoSuchElementException, TimeoutException

import time

# ---

import selenium
print('Selenium:', selenium.__version__)

# ---

def scrape_page(driver, keyword):
    try:

        
        try:
            print('Clicking cookie banner')            
            cookie_banner = driver.find_element(By.ID, "cb-close")
            cookie_banner.click()
        except Exception as e:
            print('Exception:', e)

        
        elements_dt = driver.find_elements(By.XPATH, "//div[@class='gs-title']")
        #elements_dd = driver.find_elements(By.XPATH, "//dl[@class='sample-list.results']/dd/a")
        
        print('[DEBUG] len(elements_dt):', len(elements_dt))
        # Lista per memorizzare i dati estratti
        data = []

        # Clicca su ciascun elemento
        #for index, element_dt, element_dd in enumerate(zip(elements_dt, elements_dd), 1):  # you can use `enumerate(..., 1)` to start `index` with `1`
        for index, element in enumerate(elements_dt, 1):  # you can use `enumerate(..., 1)` to start `index` with `1`
            
            try:
                article_url = element.find_element(By.XPATH, './/a').get_attribute("href")
                article_title = element.text
                
                # ... DON'T CLIK LINKS BECAUSE IT WILL REMOVE CURRENT PAGE FROM MEMPRY
                # ... AND YOU WILL LOST ACCESS TO OTHER `elements` ON CURRENT PAGE
                # ...
                # ... Get `href` and later (after loop) use `.get(href)` to access subpages. 
                
                data.append({
                    'keyword': keyword,
                    'Titolo': article_title, 
                    'URL': article_url, 
                    #'Data': article_date, 
                    #'Contenuto': article_content
                })
                
                print('[DEBUG] data:', data[-1])
                # Torna alla pagina precedente
                #driver.back()
            except Exception as e:
                print("Errore durante il clic sull'elemento:", e)
                
        # work with subpages

        # for item in data:
        #     print('[DEBUG] subpage:', item['URL'])
        #     driver.get(item['URL'])
        #     #article_date = ...
        #     #article_content = ...
        #     #item['Data'] = article_date
        #     #item['Contenuto'] = article_content
             
    except Exception as e:
        print("Errore durante lo scraping della pagina:", e)
        return None

    return data

# --- main ---

driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)

# ---

start_url = "https://www.mef.gov.it/index.html"

all_data = []

keywords = ['big data', 'machine learning', 'algoritm', 'calcolo', 'punteggio', 'predittiv', 'cloud', 'statistic',
            'automa', 'internet delle cose', 'intelligenza artificiale']

for word in keywords:

    print("Main Page:", start_url)

    # open main page 
    driver.get(start_url)

    # find searchbar
    click_on_search = driver.find_element(By.ID, "search-button")
    click_on_search.click()
    print('Search:', word)
    searchbar = driver.find_element(By.ID, "strinput")
    # put keyword in searchbar and press ENTER
    searchbar.send_keys(word)
    searchbar.send_keys(Keys.ENTER)
    
    time.sleep(5) # wait for results
    
    #get current url (because it could load different URL to show results)
    search_results_url = driver.current_url
    
    # start scraping results (with pagination):
    #while True:  # try to get all pages
    for _ in range(999):  # try to get only 3 pages
        print("Scraping:", search_results_url)
        
        page_data = scrape_page(driver, word)  # <--- only scraping, without `.get(url)`, I send `word` only to add it to `data`
        
        if page_data:
            all_data.extend(page_data)

        driver.get(search_results_url) # go back to result after visiting subpages - to get link to next page 
        
        try:
            next_page_link = driver.find_elements(By.XPATH, "//div[@class'gs-cursor-page']")
            # search_results_url = next_page_link.get_attribute("href")
            # driver.get(search_results_url)  # <--- open next page with results using URL
            next_page_link.click()   # <--- or click link 
        except Exception as e:
            print('[DEBUG] Exception:', e)
            print('[DEBUG] break')
            #input('Press ENTER to continue')
            break  # exit loop
            
driver.quit()

import pandas as pd
df = pd.DataFrame(all_data)
print(df)

input("Press ENTER to close")

df.to_excel('miur_scrape.xlsx')
python selenium-webdriver web-scraping
1个回答
0
投票
<div class="cb-dialog-overlay"></div>

上面的元素正在拦截您的搜索元素点击。您可以通过关闭 cookie 弹出窗口来摆脱此元素。

虽然关闭 cookie 弹出窗口的代码看起来不错,但您可能需要添加 Selenium 的等待 来有效处理

click()

更改以下代码:

cookie_banner = driver.find_element(By.ID, "cb-close")
cookie_banner.click()

致:

WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.ID, "cb-close"))).click()

需要进口:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
© www.soinside.com 2019 - 2024. All rights reserved.