我能够循环执行Web抓取过程,但是从随后的页面收集的数据替换了之前的页面中的数据。使excel只包含最后一页的数据。我该怎么办?
from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')
for x in range(1, 44):
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names = soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers = soup.find_all('h2', attrs={'class':'review-content__title'})
bodies = soup.find_all('p', attrs={'class':'review-content__text'})
ratings = soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates = soup.find_all('div', attrs={'class':'review-content-header__dates'})
print ('pass1')
df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Date': dates})
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')
由于使用循环,因此变量不断被覆盖。通常,在这种情况下,您要做的是拥有一个数组,然后在整个循环中将其追加:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('article', {'class':'review'})
for article in articles:
names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
bodies.append('')
try:
ratings.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
ratings.append('')
dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')
原因是因为您在每次迭代中都覆盖了变量。如果要扩展此变量,可以执行例如:
names = []
bodies = []
ratings = []
dates = []
for x in range(1, 44):
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names += soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers += soup.find_all('h2', attrs={'class':'review-content__title'})
bodies += soup.find_all('p', attrs={'class':'review-content__text'})
ratings += soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates += soup.find_all('div', attrs={'class':'review-content-header__dates'})
每次迭代后,您都必须将数据存储在某个地方。有几种方法可以做到。您可以将所有内容存储在列表中,然后创建数据框。或者我所做的是创建一个“临时”数据框,该数据框在每次迭代后创建,然后将其附加到最终数据框中。认为它像开水。您有一小桶水,然后倒入大桶,这将收集/保存您要收集的所有水。
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names = [ x.text.strip() for x in soup.find_all('div', attrs={'class': 'consumer-information__name'})]
headers = [ x.text.strip() for x in soup.find_all('h2', attrs={'class':'review-content__title'})]
bodies = [ x.text.strip() for x in soup.find_all('p', attrs={'class':'review-content__text'})]
ratings = [ x.text.strip() for x in soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})]
dateElements = soup.find_all('div', attrs={'class':'review-content-header__dates'})
for date in dateElements:
jsonData = json.loads(date.text.strip())
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')