解析单个对象中的多个Python数据帧

问题描述 投票:0回答:1

我正在尝试循环浏览网站的多个页面(在本示例中为 2 个页面),抓取相关的客户评论数据,并最终组合成一个数据框架。

我遇到的挑战是我的代码似乎在单个数据帧对象中生成两个单独的数据帧(所附代码中的

df
)。我可能会弄错,但这就是我的解释方式。

这是我上面描述的屏幕截图:

Separate data frames within single data frame object

这是生成屏幕截图结果的代码:

from bs4 import BeautifulSoup as bs
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

page = 1
urls = []
while page != 3:
    url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
    urls.append(url)
    page = page + 1

for url in urls:
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    results = soup.find(id="__NEXT_DATA__")
    json_object = json.loads(results.contents[0])
    reviews = json_object["props"]["pageProps"]["reviews"]
    ids = pd.Series([ sub['id'] for sub in reviews ])
    filtered = pd.Series([ sub['filtered'] for sub in reviews ])
    pending = pd.Series([ sub['pending'] for sub in reviews ])
    rating = pd.Series([ sub['rating'] for sub in reviews ])
    title = pd.Series([ sub['title'] for sub in reviews ])
    likes = pd.Series([ sub['likes'] for sub in reviews ])
    experienced = pd.Series([ sub['dates']['experiencedDate'] for sub in reviews ])
    published = pd.Series([ sub['dates']['publishedDate'] for sub in reviews ])
    source = url
    df = pd.DataFrame({'id': ids, 'filtered': filtered, 'pending': pending, 'rating': rating,
                   'title': title, 'likes': likes, 'experienced': experienced,
                   'published': published, 'source': source})  
    print(df)

我一直依赖这些帖子作为潜在的解决方案,但没有任何运气:

Rbind,数据帧内有数据帧会导致错误?

分析数据帧列表中的数据帧并将所有结果存储在单个数据帧中

在python中将多个数据框合并为单个数据框

具体来说,我一直收到以下错误:

typeerror: cannot concatenate object of type '<class 'str'>'; only series and dataframe objs are valid

某些“”位是问题所在的线索,但我一直在旋转,感觉我需要“放下铅笔”并寻求帮助。我对 Python 比较陌生,我的直觉告诉我需要在代码上游解决一些问题,才能从一开始就避免这个问题。换句话说,虽然可能有办法将这两个数据帧组合成一个数据帧,但我觉得问题的根源正在发生,需要尽早解决。非常感谢任何帮助。

python dataframe for-loop web-scraping beautifulsoup
1个回答
0
投票

这是一个示例,如何从多个页面获取数据帧,并作为最后一步将它们连接到最终数据帧:

import json

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

page = 1
urls = []
while page != 3:
    url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
    urls.append(url)
    page = page + 1

all_dfs = []
for url in urls:
    response = requests.get(url)
    html = response.content
    soup = bs(html, "html.parser")
    results = soup.find(id="__NEXT_DATA__")
    json_object = json.loads(results.contents[0])
    reviews = json_object["props"]["pageProps"]["reviews"]
    ids = pd.Series([sub["id"] for sub in reviews])
    filtered = pd.Series([sub["filtered"] for sub in reviews])
    pending = pd.Series([sub["pending"] for sub in reviews])
    rating = pd.Series([sub["rating"] for sub in reviews])
    title = pd.Series([sub["title"] for sub in reviews])
    likes = pd.Series([sub["likes"] for sub in reviews])
    experienced = pd.Series([sub["dates"]["experiencedDate"] for sub in reviews])
    published = pd.Series([sub["dates"]["publishedDate"] for sub in reviews])
    source = url
    df = pd.DataFrame(
        {
            "id": ids,
            "filtered": filtered,
            "pending": pending,
            "rating": rating,
            "title": title,
            "likes": likes,
            "experienced": experienced,
            "published": published,
            "source": source,
        }
    )
    all_dfs.append(df)

final_df = pd.concat(all_dfs)
print(final_df)

打印:

                          id  filtered  pending  rating                                                                            title  likes               experienced                 published                                                  source
0   660c4b524ff85128f3cd5665     False    False       5                                                                Amazing Insurance      0  2024-04-01T00:00:00.000Z  2024-04-02T20:15:47.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
1   660b08acec6384757dfabdf9     False    False       5                                                  Enrollment was quick and easy!       0  2024-03-21T00:00:00.000Z  2024-04-01T21:19:09.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
2   66098c1b0353405fb0313ae2     False    False       5                                            Extremely easy to understand website…      0  2024-03-28T00:00:00.000Z  2024-03-31T18:15:23.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
3   660b1e164e75ffb01ee011f1     False    False       2                                                                   Too expensive       0  2024-04-01T00:00:00.000Z  2024-04-01T22:50:31.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
4   66099003e9b2fe025035baef     False    False       5                                         The coverage seems really comprehensive…      0  2024-03-28T00:00:00.000Z  2024-03-31T18:32:04.000Z  https://www.trustpilot.com/review/trupanion.com?page=1
5   660b0af515413b0620a7d617     False    False       4                                            Everything was explained to us in an…      0  2024-03-29T00:00:00.000Z  2024-04-01T21:28:54.000Z  https://www.trustpilot.com/review/trupanion.com?page=1

...
© www.soinside.com 2019 - 2024. All rights reserved.