抓取特定类别的分层网站

问题描述 投票:0回答:1

我正在尝试抓取以下页面:“https://esco.ec.europa.eu/en/classification/skill_main”。特别是,我想单击 S-skills 下的所有加号按钮,除非不再有“加号按钮”可供单击,然后保存该页面源。现在,在检查页面时发现加号按钮位于 CSS 选择器“.api_hierarchy.has-child-link”下方,我尝试如下:


from selenium.common.exceptions import StaleElementReferenceException

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://esco.ec.europa.eu/en/classification/skill_main")
driver.implicitly_wait(10)

wait = WebDriverWait(driver, 20)

# Define a function to click all expandable "+" buttons
def click_expand_buttons():
    while True:
        try:
            # Find all expandable "+" buttons
            expand_buttons = wait.until(EC.presence_of_all_elements_located(
                (By.CSS_SELECTOR, ".api_hierarchy.has-child-link"))
            )

            # If no expandable buttons are found, we are done
            if not expand_buttons:
                break

            # Click each expandable "+" button
            for button in expand_buttons:
                try:
                    driver.implicitly_wait(10)
                    driver.execute_script("arguments[0].click();", button)
                    # Wait for the dynamic content to load
                    time.sleep(1)
                except StaleElementReferenceException:
                    # If the element is stale, we find the elements again
                    break
        except StaleElementReferenceException:
            continue

# Call the function to start clicking "+" buttons
click_expand_buttons()

html_source = driver.page_source

# Save the HTML to a file
with open("/Users/federiconutarelli/Desktop/escodata/expanded_esco_skills_page.html", "w", encoding="utf-8") as file:
    file.write(html_source)

# Close the browser
driver.quit()

但是,上面的代码不断关闭并打开“第一级”的 +,这可能是因为,以我有限的抓取知识,我只是要求 selenium 单击加号按钮,直到出现加号按钮,并且当页面刷新到原始页面,脚本不断地执行下去。现在我的问题是:如何仅针对S技能打开所有加号(直到有加号):

<a href="#overlayspin" class="change_right_content" data-version="ESCO dataset - v1.1.2" data-link="http://data.europa.eu/esco/skill/335228d2-297d-4e0e-a6ee-bc6a8dc110d9" data-id="84527">S - skills</a>

提前感谢您,如果我没有进一步了解,我很抱歉,但我认为我的抓取知识达到了瓶颈。

python html selenium-webdriver web-scraping
1个回答
0
投票

我认为这会对你有帮助,没有测试过。但你在自己的代码上付出了努力

我现在更多 XPATH,所以我将 CSS 选择器更改为 XPATH

其余代码应该相同且有效

# Find all expandable "+" buttons
expand_buttons = wait.until(EC.presence_of_all_elements_located(
    (By.XPATH, "//div[@class='main_item classification_item' and ./a[text()='S - skills']]//span[@class='api_hierarchy has-child-link']"))
    )
© www.soinside.com 2019 - 2024. All rights reserved.