同一标签的多个值不被抓取

问题描述 投票:0回答:2

我的“房间数量”和“房间”搜索没有得到任何值。

https://www.zoopla.co.uk/property/uprn/906032139/

我可以在这里看到我应该返回一些东西,但没有得到任何东西。

任何人都可以指出我如何解决这个问题的正确方向吗?我什至不知道要搜索什么,因为它没有错误。我认为它会将所有数据放入其中,然后我需要找到一种方法来分离它们。我需要把它刮进字典吗?

import requests
from bs4 import BeautifulSoup as bs
import numpy as np
import pandas as pd
import matplotlib as plt
import time

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.5",
    "Referer": "https://google.co.uk",
    "DNT": "1"
}

page = 1
addresses = []
while page != 2:
    url = f"https://www.zoopla.co.uk/house-prices/edinburgh/?pn={page}"
    print(url)
    response = requests.get(url, headers=headers)
    print(response)
    html = response.content
    soup = bs(html, "lxml")
    time.sleep(1)
    for address in soup.find_all("div", class_="c-rgUPM c-rgUPM-pnwXf-hasUprn-true"):
        details = {}
        # Getting the address
        details["Address"] = address.h2.get_text(strip=True)
        # Getting each addresses unique URL
        scotland_house_url = f'https://www.zoopla.co.uk{address.find("a")["href"]}'
        details["URL"] = scotland_house_url
        scotland_house_url_response = requests.get(
            scotland_house_url, headers=headers)
        scotland_house_soup = bs(scotland_house_url_response.content, "lxml")
        # Lists status of the property
        try:
            details["Status"] = [status.get_text(strip=True) for status in scotland_house_soup.find_all(
                "span", class_="css-10o3xac-Tag e164ranr11")]
        except AttributeError:
            details["Status"] = ""
        # Lists the date of the status of the property
        try:
            details["Status Date"] = [status_date.get_text(
                strip=True) for status_date in scotland_house_soup.find_all("p", class_="css-1jq4rzj e164ranr10")]
        except AttributeError:
            details["Status Date"] = ""
        # Lists the value of the property
        try:
            details["Value"] = [value.get_text(strip=True).replace(",", "").replace(
                "£", "") for value in scotland_house_soup.find_all("p", class_="css-1x01gac-Text eczcs4p0")]
        except AttributeError:
            details["Value"] = ""
         # Lists the number of rooms
        try:
            details["Number of Rooms"] = [number_of_rooms.get_text(strip=True) for number_of_rooms in scotland_house_soup.find_all(
                "p", class_="css-82kmy1 e13gx5i3")]
        except AttributeError:
            details["Number of Rooms"] = ""
         # Lists type of room
        try:
            details["Room"] = [room.get_text(strip=True) for room in scotland_house_soup.find_all(
                "span", class_="css-1avcdf2 e13gx5i4")]
        except AttributeError:
            details["Room"] = ""
        addresses.append(details)
    page = page + 1

for address in addresses[:]:
    print(address)
print(response)
python web-scraping beautifulsoup python-requests
2个回答
1
投票

通过

class_="css-1avcdf2 e13gx5i4"
选择似乎很脆弱,班级可能会一直在变化。尝试不同的 CSS 选择器:

import requests
from bs4 import BeautifulSoup

url = "https://www.zoopla.co.uk/property/uprn/906032139/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

tag = soup.select_one('#timeline p:has(svg[data-testid="bed"]) + p')

no_beds, beds = tag.get_text(strip=True, separator=" ").split()
print(no_beds, beds)

打印:

1 bed

如果您想要所有类型的房间:

for detail in soup.select("#timeline p:has(svg[data-testid]) + p"):
    n, type_ = detail.get_text(strip=True, separator="|").split("|")
    print(n, type_)

打印:

1 bed
1 bath
1 reception

0
投票

我正在对房地产估值进行研究分析,我需要进行与您类似的数据收集。你终于能整理出zoopla的抓取数据了吗?

© www.soinside.com 2019 - 2024. All rights reserved.