TripAdvisor 网页抓取特定值是不可能的

问题描述 投票:0回答:1

我是网络抓取新手,希望从一些特定的 TripAdvisor 网站(如 this)获取特定值,我需要清洁度值,在此示例中为 4.5。无论我尝试 HTML 的哪一部分,都无法获取它。在 Booking 或 HolidayCheck 等网站上,它就像一个魅力。

Value needed is 4,5

import requests
from lxml import html
import time

url = 'https://www.tripadvisor.com/Hotel_Review-g187399-d200757-Reviews-Pullman_Dresden_Newa_Hotel-Dresden_Saxony.html'

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Sec-Ch-Ua': '"Not A(Brand";v="99", "Google Chrome";v="121", "Chromium";v="121"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
}

time.sleep(5)

response = requests.get(url, headers=headers)
tree = html.fromstring(response.content)


```cleanliness_xpath = "//div[@class='uqMDf z BGJxv YGfmd YQkjl']//div[@class='ZPHZV']//div[@class='tJRnI']/span[contains(text(), 'Cleanliness')]/following-sibling::div[@class='BqYzr']/span[@class='MUlry']"  

cleanliness_element = tree.xpath(cleanliness_xpath)

# Überprüfen, ob ein Wert gefunden wurde
if cleanliness_element:
    cleanliness_rating = float(cleanliness_element[0].text) / 10  
    print(f"Cleanliness rating: {cleanliness_rating}")
else:
    print("Cleanliness rating not found")

python web-scraping python-requests lxml
1个回答
0
投票

要获取清洁度值,您需要更改 cleaniness_xpath 值。

cleanliness_xpath = "//div[contains(@data-tab, 'TABS_ABOUT')]//span[contains(., 'Cleanliness')]/following-sibling::span"

对我来说,上面的 xpath 工作得很好。

说明:

//div[contains(@data-tab, 'TABS_ABOUT')] - This will go to the About section in the web page
//span[contains(., 'Cleanliness')] - This will go to the Cleanliness rating line

清洁度评级线的 HTML 代码块

<div class="tJRnI">
        <span>Cleanliness</span>
        <div class="BqYzr">
                <div class="WXMiS" style="width:89.457368px"></div>
        </div>
        <span class="MUlry">4.5</span>
</div>

根据上面的html代码,在清洁度之后我们可以找到评级

/following-sibling::span - This will find the span tag which is under the current parent and in the same level with the current tag
//div[contains(@data-tab, 'TABS_ABOUT')]//span[contains(., 'Cleanliness')]/following-sibling::span - This will go to the About section, then to the span tag which contains Cleanliness string and checks for the span tag which is in the same level as current.

下面是我得到的输出。 这不是 4.5,因为您将评分除以 10。

清洁度等级:0.45

© www.soinside.com 2019 - 2024. All rights reserved.