Web抓取。列而不是行

问题描述 投票:1回答:1

我在抓取数据并将其保存到一致的列时遇到困难。更具体地说,我抓取的网站对于我抓取的每个项目(键和值除外)没有单独的标签。

结果,我得到一个包含2行的CSV文件-键和值以及其中的对应文本,而我的想法是获取列。

是否可以保持标题不变并附加值项,或者根据网站的具体情况,这是不可能的?

谢谢你。

import requests
import bs4
import pandas as pd

keys = []
values = []

for pagenumber in range (0,2):
        url = 'https://www.marktplaats.nl/l/auto-s/p/'
        txt = requests.get(url+str(pagenumber))
        soup = bs4.BeautifulSoup(txt.text, 'html.parser')
        soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')

        for car in soup_table.findAll('li'):
            link = car.find('a')
            sub_url = 'https://www.marktplaats.nl/' + link.get('href')

            sub_soup = requests.get(sub_url)
            soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
            soup1 = soup1.find('div', {'id': 'car-attributes'})

            for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):

                    key = car_item.find('span', {'class': 'key'}).text
                    keys.append(key)

                    value = car_item.find('span', {'class': 'value'}).text
                    values.append(value)


auto_database = pd.DataFrame({
                               'key': keys,
                               'value': values,
                              })

auto_database.to_csv('auto_database.csv')
print("Successfully saved..")

结果

Merk & Model: Lako
Bouwjaar: 1996
Uitvoering: 233 C
Carrosserie: Open wagen
Kenteken: OD-31-VD
APK tot: 29 juni 2020
Prijs: € 7.500,00


Merk & Model: RAM
Bouwjaar: 2020
Carrosserie: SUV of Terreinwagen
Brandstof: LPG
Kilometerstand: 70 km
Transmissie: Automaat
Prijs: Zie omschrijving
Motorinhoud: 5.700 cc
Opties: 

想要的结果

Merk & Model    Bouwjaar
RAM              2020
python web-scraping
1个回答
0
投票

我建议将每个汽车项目的所有元数据保存到一个数据框,将键设置为索引,并将所有中间数据框连接到最后一个。

尝试一下:

import requests
import bs4
import pandas as pd

frames = []

for pagenumber in range (0,2):
        url = 'https://www.marktplaats.nl/l/auto-s/p/'
        txt = requests.get(url+str(pagenumber))
        soup = bs4.BeautifulSoup(txt.text, 'html.parser')
        soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')

        for car in soup_table.findAll('li'):

            link = car.find('a')
            sub_url = 'https://www.marktplaats.nl/' + link.get('href')

            sub_soup = requests.get(sub_url)
            soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
            soup1 = soup1.find('div', {'id': 'car-attributes'})

            tmp = []

            for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):

                key = car_item.find('span', {'class': 'key'}).text
                value = car_item.find('span', {'class': 'value'}).text
                tmp.append([key, value])

            frames.append(pd.DataFrame(tmp).set_index(0))


df_final = pd.concat((tmp_df for tmp_df in frames), axis=1, join='outer').reset_index()
df_final = df_final.T
df_final.columns = df_final.loc["index"].values
df_final.drop("index", inplace=True)
df_final.reset_index(inplace=True, drop=True)
df_final.to_csv('auto_database.csv')
display(df_final.head(3))

输出:

Bouwjaar:   Brandstof:  Kilometerstand:     Transmissie:    Prijs:  Motorinhoud:    Kenteken:   Opties:     Merk & Model:   Carrosserie:    Uitvoering:     APK tot:    Energielabel:   Verbruik:   Topsnelheid:    Kosten p/m:     Vermogen:   APK:    Datum van registratie:
0   2014    Diesel  10.000 km   Automaat    € 10.950,00     400 cc  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   2011    Diesel  25.000 km   Handgeschakeld  Op aanvraag     1.500 cc    VR-921-X    \n\nParkeersensor\nMetallic lak\nBoordcomputer...   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   2016    Benzine     95.545 km   Handgeschakeld  € 230,00    1.395 cc    NaN     \n\nParkeersensor\nMetallic lak\nRadio\nMistla...   A3  Sedan   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
© www.soinside.com 2019 - 2024. All rights reserved.