我正在尝试循环浏览网站的多个页面(在本示例中为 2 个页面),抓取相关的客户评论数据,并最终组合成一个数据框架。
我遇到的挑战是我的代码似乎在单个数据帧对象中生成两个单独的数据帧(所附代码中的
df
)。我可能会弄错,但这就是我的解释方式。
这是我上面描述的屏幕截图:
这是生成屏幕截图结果的代码:
from bs4 import BeautifulSoup as bs
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
page = 1
urls = []
while page != 3:
url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
urls.append(url)
page = page + 1
for url in urls:
response = requests.get(url)
html = response.content
soup = bs(html, "html.parser")
results = soup.find(id="__NEXT_DATA__")
json_object = json.loads(results.contents[0])
reviews = json_object["props"]["pageProps"]["reviews"]
ids = pd.Series([ sub['id'] for sub in reviews ])
filtered = pd.Series([ sub['filtered'] for sub in reviews ])
pending = pd.Series([ sub['pending'] for sub in reviews ])
rating = pd.Series([ sub['rating'] for sub in reviews ])
title = pd.Series([ sub['title'] for sub in reviews ])
likes = pd.Series([ sub['likes'] for sub in reviews ])
experienced = pd.Series([ sub['dates']['experiencedDate'] for sub in reviews ])
published = pd.Series([ sub['dates']['publishedDate'] for sub in reviews ])
source = url
df = pd.DataFrame({'id': ids, 'filtered': filtered, 'pending': pending, 'rating': rating,
'title': title, 'likes': likes, 'experienced': experienced,
'published': published, 'source': source})
print(df)
我一直依赖这些帖子作为潜在的解决方案,但没有任何运气:
具体来说,我一直收到以下错误:
typeerror: cannot concatenate object of type '<class 'str'>'; only series and dataframe objs are valid
某些“
这是一个示例,如何从多个页面获取数据帧,并作为最后一步将它们连接到最终数据帧:
import json
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
page = 1
urls = []
while page != 3:
url = f"https://www.trustpilot.com/review/trupanion.com?page={page}"
urls.append(url)
page = page + 1
all_dfs = []
for url in urls:
response = requests.get(url)
html = response.content
soup = bs(html, "html.parser")
results = soup.find(id="__NEXT_DATA__")
json_object = json.loads(results.contents[0])
reviews = json_object["props"]["pageProps"]["reviews"]
ids = pd.Series([sub["id"] for sub in reviews])
filtered = pd.Series([sub["filtered"] for sub in reviews])
pending = pd.Series([sub["pending"] for sub in reviews])
rating = pd.Series([sub["rating"] for sub in reviews])
title = pd.Series([sub["title"] for sub in reviews])
likes = pd.Series([sub["likes"] for sub in reviews])
experienced = pd.Series([sub["dates"]["experiencedDate"] for sub in reviews])
published = pd.Series([sub["dates"]["publishedDate"] for sub in reviews])
source = url
df = pd.DataFrame(
{
"id": ids,
"filtered": filtered,
"pending": pending,
"rating": rating,
"title": title,
"likes": likes,
"experienced": experienced,
"published": published,
"source": source,
}
)
all_dfs.append(df)
final_df = pd.concat(all_dfs)
print(final_df)
打印:
id filtered pending rating title likes experienced published source
0 660c4b524ff85128f3cd5665 False False 5 Amazing Insurance 0 2024-04-01T00:00:00.000Z 2024-04-02T20:15:47.000Z https://www.trustpilot.com/review/trupanion.com?page=1
1 660b08acec6384757dfabdf9 False False 5 Enrollment was quick and easy! 0 2024-03-21T00:00:00.000Z 2024-04-01T21:19:09.000Z https://www.trustpilot.com/review/trupanion.com?page=1
2 66098c1b0353405fb0313ae2 False False 5 Extremely easy to understand website… 0 2024-03-28T00:00:00.000Z 2024-03-31T18:15:23.000Z https://www.trustpilot.com/review/trupanion.com?page=1
3 660b1e164e75ffb01ee011f1 False False 2 Too expensive 0 2024-04-01T00:00:00.000Z 2024-04-01T22:50:31.000Z https://www.trustpilot.com/review/trupanion.com?page=1
4 66099003e9b2fe025035baef False False 5 The coverage seems really comprehensive… 0 2024-03-28T00:00:00.000Z 2024-03-31T18:32:04.000Z https://www.trustpilot.com/review/trupanion.com?page=1
5 660b0af515413b0620a7d617 False False 4 Everything was explained to us in an… 0 2024-03-29T00:00:00.000Z 2024-04-01T21:28:54.000Z https://www.trustpilot.com/review/trupanion.com?page=1
...