我在抓取数据并将其保存到一致的列时遇到困难。更具体地说,我抓取的网站对于我抓取的每个项目(键和值除外)没有单独的标签。
结果,我得到一个包含2行的CSV文件-键和值以及其中的对应文本,而我的想法是获取列。
是否可以保持标题不变并附加值项,或者根据网站的具体情况,这是不可能的?
谢谢你。
import requests
import bs4
import pandas as pd
keys = []
values = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
txt = requests.get(url+str(pagenumber))
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
soup1 = soup1.find('div', {'id': 'car-attributes'})
for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'}).text
keys.append(key)
value = car_item.find('span', {'class': 'value'}).text
values.append(value)
auto_database = pd.DataFrame({
'key': keys,
'value': values,
})
auto_database.to_csv('auto_database.csv')
print("Successfully saved..")
结果
Merk & Model: Lako
Bouwjaar: 1996
Uitvoering: 233 C
Carrosserie: Open wagen
Kenteken: OD-31-VD
APK tot: 29 juni 2020
Prijs: € 7.500,00
Merk & Model: RAM
Bouwjaar: 2020
Carrosserie: SUV of Terreinwagen
Brandstof: LPG
Kilometerstand: 70 km
Transmissie: Automaat
Prijs: Zie omschrijving
Motorinhoud: 5.700 cc
Opties:
想要的结果
Merk & Model Bouwjaar
RAM 2020
我建议将每个汽车项目的所有元数据保存到一个数据框,将键设置为索引,并将所有中间数据框连接到最后一个。
尝试一下:
import requests
import bs4
import pandas as pd
frames = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
txt = requests.get(url+str(pagenumber))
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
soup1 = soup1.find('div', {'id': 'car-attributes'})
tmp = []
for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'}).text
value = car_item.find('span', {'class': 'value'}).text
tmp.append([key, value])
frames.append(pd.DataFrame(tmp).set_index(0))
df_final = pd.concat((tmp_df for tmp_df in frames), axis=1, join='outer').reset_index()
df_final = df_final.T
df_final.columns = df_final.loc["index"].values
df_final.drop("index", inplace=True)
df_final.reset_index(inplace=True, drop=True)
df_final.to_csv('auto_database.csv')
display(df_final.head(3))
输出:
Bouwjaar: Brandstof: Kilometerstand: Transmissie: Prijs: Motorinhoud: Kenteken: Opties: Merk & Model: Carrosserie: Uitvoering: APK tot: Energielabel: Verbruik: Topsnelheid: Kosten p/m: Vermogen: APK: Datum van registratie:
0 2014 Diesel 10.000 km Automaat € 10.950,00 400 cc NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2011 Diesel 25.000 km Handgeschakeld Op aanvraag 1.500 cc VR-921-X \n\nParkeersensor\nMetallic lak\nBoordcomputer... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2016 Benzine 95.545 km Handgeschakeld € 230,00 1.395 cc NaN \n\nParkeersensor\nMetallic lak\nRadio\nMistla... A3 Sedan NaN NaN NaN NaN NaN NaN NaN NaN NaN