在我不起眼的助手 ChatGPT 的帮助下,我组装了一个小程序数据收集器,用于抓取 Apartments.com 并查找我所在城市的公寓大楼的所有名称、价格范围、数量等。
部分功能,但似乎无法在新选项卡上找到“.phoneNumber”的 .css 代码。我试图让它以几种不同的方式寻找不同的明显 CSS、HTML、href。一旦离开主选项卡,它似乎就找不到任何内容。
现在我承认我在编码方面非常缺乏经验,并且从未将任何复杂的东西组合在一起,但看起来它应该对我有用。如果我能得到一些帮助,我将不胜感激!输出和代码如下:
控制台日志:
beginning pagination
Park Wilshire
2424 Wilshire Blvd, Los Angeles, CA 90057
$1,495 - 2,870
Studio - 1 Bed
Traceback (most recent call last):
File "C:\Users\...\aptScraper\main.py", line 1108, in <module>
phone_link = driver.find_element(By.XPATH, "//a[contains(@class,'.phoneNumber')]")
现在是真正的代码:
from selenium import webdriver
import csv
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
import re
# Open a browser and navigate to apartments.com
driver = webdriver.Chrome()
driver.get("https://www.apartments.com/los-angeles-ca/")
# Find the search box and input "Los Angeles Ca"
search_box = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "searchBarLookup")))
search_box.send_keys("Los Angeles, CA")
# Click the search button
search_box.send_keys(Keys.RETURN)
# Wait for the first page of listings to load
apartments = WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".placard")))
print("beginning pagination")
# Store the information in a CSV file
with open('apartments.csv', mode='w', encoding='utf-8', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Address', 'Rent', 'Bedrooms', 'Phone'])
while True:
for apartment in apartments:
try:
name = apartment.find_element(By.CSS_SELECTOR, ".property-title, .js-placardTitle")
print(name.text)
except:
continue
address = apartment.find_element(By.CSS_SELECTOR, ".property-address")
print(address.text)
try:
rent = apartment.find_element(By.CSS_SELECTOR, ".property-pricing, .property-rents")
print(rent.text)
except:
continue
bedrooms = apartment.find_element(By.CSS_SELECTOR, ".property-beds")
print(bedrooms.text)
phone_number = None
apartment_link = apartment.find_element(By.CSS_SELECTOR, ".property-link").get_attribute("href")
driver.execute_script(f"window.open('{apartment_link}');")
driver.switch_to.window(driver.window_handles[-1])
time.sleep(1)
#problem code is here
phone_link = driver.find_element(By.XPATH, "//a[contains(@class,'.phoneNumber')]")
if phone_link:
phone_number = re.search(r'\d{10}', phone_link.get_attribute('href')).group()
print("phone number found!", phone_number, " for: ", name.text)
writer.writerow([name.text, address.text, rent.text, bedrooms.text, phone_number])
else:
print(f"No phone number found for {name.text} at {address.text}")
driver.close()
driver.switch_to.window(driver.window_handles[0])
# Check if there is a next page button
time.sleep(1)
next_button = driver.find_element(By.CSS_SELECTOR, ".next")
if "disabled" in next_button.get_attribute("class"):
break
next_button.click()
apartments = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".placard")))
而且 - 不用说 - 它在完成之前崩溃,因为它找不到有效的“.phoneNumber”条目。不过,当我检查页面元素时,它肯定存在。给了什么?
所需公寓的选项卡打开,但我无法弄清楚如何让 selenium 在新选项卡中找到“.phoneNumber”元素,或者任何元素。请指教
Solved!
问题是两方面的。第一,当我更新 find_element() 以指向页面上的有效元素时,它不会在不崩溃的情况下运行编写器,因为最后一页的原始元素引用丢失了。
我的解决方案是对过渡期间丢失的任何内容使用 copy.deepcopy(rent, etc.)。