Selenium Webdriver 在迭代 df 时跳过一些链接 - python

Question

我正在尝试使用 selenium python 从大约 1800 个具有类似表格格式的网页中抓取一些数据。我有一个数据框，其中包含每个必要页面的链接。然而，当我使用 selenium 从每个页面获取数据（即特定产品的存在）时，程序会简单地跳过某些链接。

这是一个示例网页，其中包含我正在使用的数据（https://www.xpel.com/clearbra-installers/united-states/arizona/tempe）。对于此页面上的 5 个商店中的每一个，我都会提取商店名称、地址以及 4 个可用产品的类名称（具有“活动”属性或不活动属性）。

我尝试了以下解决方案，它从 df_in 获取每个链接，导航到每个页面，查找每个城市的商店数量，并提取每个商店的必要数据。

for link in df_in['Link']:
    # get link, then wait for page to load
    driver.get(link)
    wait.until(lambda d: d.execute_script('return document.readyState') == 'complete')
    # find the number of stores
    dealership_options = driver.find_elements(By.CLASS_NAME, "dealer-list-cell")
    num_dealership_options = len(dealership_options)

    for i in range (1, (num_dealership_options + 1)):
        # for each store, collect the name, address, and products
        name = driver_two.find_element(By.XPATH, "(//div[@class='dealer-list-cell-name'])[" + str(i) + "]")
        address = driver.find_element(By.XPATH, "(//div[@class='dealer-list-cell-address'])[" + str(i) + "]")
        wait.until(EC.presence_of_element_located((By.XPATH, "(//div[@class='dealer-list-cell-xpel-logos'])[" + str(i) + "]/div[1]")))
        p1 = driver.find_element(By.XPATH, "(//div[@class='dealer-list-cell-xpel-logos'])[" + str(i) + "]/div[1]")
        p2 = driver.find_element(By.XPATH, "(//div[@class='dealer-list-cell-xpel-logos'])[" + str(i) + "]/div[2]")
        p3 = driver.find_element(By.XPATH, "(//div[@class='dealer-list-cell-xpel-logos'])[" + str(i) + "]/div[3]")
        p4 = driver.find_element(By.XPATH, "(//div[@class='dealer-list-cell-xpel-logos'])[" + str(i) + "]/div[4]")

        # add the new data to df_out
        new_row = {"Link": link, 
                   "Address": address.text, 
                   "Name": name.text, 
                   "P1": p1.get_attribute("class"), 
                   "P2": p2.get_attribute("class"), 
                   "P3": p3.get_attribute("class"), 
                   "P4": p4.get_attribute("class")}
        df_out = pd.concat([df_out, pd.DataFrame([new_row])], ignore_index=True)

        # print the link, just to keep track of where we are
        print(link + " is done")

大多数时候，这个程序运行得很好。然而，每个人有时似乎都会陷入困境，它会停止打印“链接已完成”大约 2 分钟，然后继续使用列表下方的链接再次运行。例如，在一次运行中，它打印“.../new-mexico/clovis is done”，停止打印任何内容几分钟，然后打印 “.../美国/南卡罗来纳州/里奇兰已完成”。在此过程中，它一定一直从纽约州到罗德岛州（按字母顺序）运行，但从未打印这些名称，收集适当的数据，或将它们添加到 df_out。

我已经尝试过使用 wait.until 了。我还尝试添加显式等待。我什至尝试一次筛选列表的一小部分。但对于看似随机的状态块似乎没有任何作用。

可能发生什么？这是等待时间的问题吗？如果是这样，为什么它仍然在后台运行，几分钟后就恢复正常？

Answer 1

您可以尝试使用 Beautiful Soup 库来代替 selenium 进行刮擦。尝试一下，希望能解决问题。

Selenium Webdriver 在迭代 df 时跳过一些链接 - python

问题描述投票：0回答：1

1个回答

最新问题

Selenium Webdriver 在迭代 df 时跳过一些链接 - python

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1