Xpath HTML 抓取不返回文本/数字 - 有用的分数

问题描述 投票:0回答:1

我正在使用 xpath 和 lxml 抓取评论的有用性分数。

#%% Step 1: Import all of the extensions and packages.
from lxml import html
from urllib import request
import requests
from datetime import datetime
import csv
import re
from glob import glob
import pandas as pd

reviewcontent = []
usefulness

#%%
import glob
path = pathx
for files in glob.glob(path + "*.htm*"):
    with open(files, "r", encoding="utf-8", errors="ignore") as f:
        page = f.read()
        tree = html.fromstring(page)
        reviews = tree.xpath('//*[@class="styles_reviewContent__0Q2Tg"]')
        reviews = [r.text_content() for r in reviews]
        reviews = [r.replace('\n', ' ') for r in reviews]
        reviews = [r.replace('\r', ' ') for r in reviews]
        reviews = [r.lstrip() for r in reviews]
        reviewcontent += reviews    
        useful = tree.xpath('//*[@class="typography_body-m__xgxZ_ typography_appearance-inherit__D7XqR styles_usefulLabel__qz3JV"]')
        useful = [u.text_content() for u in useful]
        useful = [u.lstrip() for u in useful]
        helpfulness += useful

虽然我可以完美提取评论内容,但不知何故代码无法提取有用性分数?它确实有效并作为输出提供:

'Useful'
'Useful1' 
'Useful'
'Useful2' 

即第二次评论收到了 1 票,第四次收到了 2 票。然而,不知何故,我改变了一些东西,或者我不知道什么,但它不再得到任何输出

示例链接:https://www.trustpilot.com/review/trivago.com

因此,我的目标是为每条评论收集他们收到的票数,包括 0 票。

尝试了不同的配置和 stackoverflow 主题,还查看了跨度代码,但没有帮助。

谢谢!

python web-scraping xpath lxml
1个回答
0
投票

要获取不同页面上评论的分数、标题、文本,您可以使用下一个示例:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.trustpilot.com/review/trivago.com?page="

for page in range(1, 4):  # <-- adjust number of pages here
    soup = BeautifulSoup(requests.get(url + str(page)).content, "html.parser")
    data = soup.select_one("#__NEXT_DATA__")
    data = json.loads(data.text)

    for review in data["props"]["pageProps"]["reviews"]:
        print(f"{review['rating']}/5", review["title"])
        print(review["text"])
        print("-" * 80)

打印:


...

--------------------------------------------------------------------------------
1/5 ziro trust to this website
ziro trust to this website, I did sign in and creat an account, find my hotel, made reservation step by step, gave all information and at least credit card number and reserved, suddenly the page disappear. no E-mail, attention, such a good website.
--------------------------------------------------------------------------------
5/5 Trivago is the best option on the market
Not clear why so many negative feedbacks.
Trivago is the way to go to choose for your hotel.
There is no better place for you to compare the prices of so many booking sites. The UI can be improved but honestly it's great. Honest review
--------------------------------------------------------------------------------
5/5 We booked through An online booking…
We booked through An online booking agency Via trivago, the other agency, aroma or something, went out of business some six months before our holiday but the first we knew was when we went to book into our hotel. They told us that the booking had been cancelled months earlier because the company had not sent the money we paid to them and just told them that they were going out of business, we had to pay for our holiday again at a higher price... our Travel insurance Company told us “not our problem” so we were stuck. After our holiday we contacted Trivago and we got a refund of what we paid aroma, it took some time because of this corana thing but we understood that and it was great that they honoured the booking faulted bu the other company. Also the communication between us and trivago was sensational, They answered our concerns within 12 hours or so, which is great since they are on the other side of the world..well done Trivago and thank you...😀😀😀
--------------------------------------------------------------------------------

...
© www.soinside.com 2019 - 2024. All rights reserved.