使用 Selenium/Python 抓取公寓 - 无法抓取新标签?

问题描述 投票:0回答:1

在我不起眼的助手 ChatGPT 的帮助下,我组装了一个小程序数据收集器,用于抓取 Apartments.com 并查找我所在城市的公寓大楼的所有名称、价格范围、数量等。

部分功能,但似乎无法在新选项卡上找到“.phoneNumber”的 .css 代码。我试图让它以几种不同的方式寻找不同的明显 CSS、HTML、href。一旦离开主选项卡,它似乎就找不到任何内容。

现在我承认我在编码方面非常缺乏经验,并且从未将任何复杂的东西组合在一起,但看起来它应该对我有用。如果我能得到一些帮助,我将不胜感激!输出和代码如下:

控制台日志:

beginning pagination
Park Wilshire
2424 Wilshire Blvd, Los Angeles, CA 90057
$1,495 - 2,870
Studio - 1 Bed
Traceback (most recent call last):
  File "C:\Users\...\aptScraper\main.py", line 1108, in <module>
    phone_link = driver.find_element(By.XPATH, "//a[contains(@class,'.phoneNumber')]")

现在是真正的代码:

from selenium import webdriver
import csv
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
import re

# Open a browser and navigate to apartments.com
driver = webdriver.Chrome()
driver.get("https://www.apartments.com/los-angeles-ca/")

# Find the search box and input "Los Angeles Ca"
search_box = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "searchBarLookup")))
search_box.send_keys("Los Angeles, CA")

# Click the search button
search_box.send_keys(Keys.RETURN)

# Wait for the first page of listings to load
apartments = WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".placard")))
print("beginning pagination")

# Store the information in a CSV file
with open('apartments.csv', mode='w', encoding='utf-8', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Address', 'Rent', 'Bedrooms', 'Phone'])

    while True:
        for apartment in apartments:
            try:
                name = apartment.find_element(By.CSS_SELECTOR, ".property-title, .js-placardTitle")
                print(name.text)
            except:
                continue

            address = apartment.find_element(By.CSS_SELECTOR, ".property-address")
            print(address.text)

            try:
                rent = apartment.find_element(By.CSS_SELECTOR, ".property-pricing, .property-rents")
                print(rent.text)
            except:
                continue

            bedrooms = apartment.find_element(By.CSS_SELECTOR, ".property-beds")
            print(bedrooms.text)

            phone_number = None
            apartment_link = apartment.find_element(By.CSS_SELECTOR, ".property-link").get_attribute("href")
            driver.execute_script(f"window.open('{apartment_link}');")
            driver.switch_to.window(driver.window_handles[-1])
            time.sleep(1)
#problem code is here
            phone_link = driver.find_element(By.XPATH, "//a[contains(@class,'.phoneNumber')]")

            if phone_link:
                phone_number = re.search(r'\d{10}', phone_link.get_attribute('href')).group()
                print("phone number found!", phone_number, " for: ", name.text)
                writer.writerow([name.text, address.text, rent.text, bedrooms.text, phone_number])
            else:
                print(f"No phone number found for {name.text} at {address.text}")

            driver.close()
            driver.switch_to.window(driver.window_handles[0])

        # Check if there is a next page button
        time.sleep(1)
        next_button = driver.find_element(By.CSS_SELECTOR, ".next")

        if "disabled" in next_button.get_attribute("class"):
            break

        next_button.click()
        apartments = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".placard")))

而且 - 不用说 - 它在完成之前崩溃,因为它找不到有效的“.phoneNumber”条目。不过,当我检查页面元素时,它肯定存在。给了什么?

所需公寓的选项卡打开,但我无法弄清楚如何让 selenium 在新选项卡中找到“.phoneNumber”元素,或者任何元素。请指教

python python-3.x selenium-webdriver web-scraping export-to-csv
1个回答
0
投票

Solved!

问题是两方面的。第一,当我更新 find_element() 以指向页面上的有效元素时,它不会在不崩溃的情况下运行编写器,因为最后一页的原始元素引用丢失了。

我的解决方案是对过渡期间丢失的任何内容使用 copy.deepcopy(rent, etc.)。

© www.soinside.com 2019 - 2024. All rights reserved.