美丽的汤刮了多页,下一页缺少值

问题描述 投票:1回答:1

我正在使用Beautifulsoup在多页网站上抓取汽车名称和价格列表。在一页中,它包含40个数据,并且如果仅刮一页,该代码将正常工作。当涉及到抓取多页时(这种情况下,我只抓取两页以检查代码是否正常工作),我发现在下一页的开头总是缺少数据(列“价格”),这使得数据是从数据41开始未正确对齐。

关于价格列数据的一些注释:列出的价格可以是('ads_price_highlight'),也可以是('ads_price'),是折扣价。

下面是我为此案例解析多页代码。我仍然不知道为什么我在价格列上得到了丢失的数据,而另一列是正确的。

from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np

from time import sleep
from random import randint

headers = {"Accept-Language": "en-US, en;q=0.5"}

car = []
price = []

pages = np.arange(1,3,1)

for page in pages:

  url = 'https://www.mudah.my/malaysia/cars-for-sale/perodua?o='+ str(page) +'&q=&so=1&th=1'
  page = requests.get(url, headers=headers)

  soup = BeautifulSoup(page.text, 'html.parser')
  car_list = soup.find_all('li', class_='listing_ads_params')

  sleep(randint(2,10))

  for container in car_list:
        cars = container.find('div', {'class':'top_params_col1'})
        if cars is not None:
            car.append(cars.find('h2', {'class': 'list_title'}).text)   

        prices2 = container.find('div', class_='ads_price_highlight')
        if prices2 is not None:
            price.append(prices2.text)

        prices = container.find('div', class_='ads_price')
        if prices is not None:
            price.append(prices.text)

df = pd.DataFrame(data = list(zip(car, price)),
                    columns = ['car', 'price'])

df.to_csv(r'carprice.csv', index = False)
python web-scraping beautifulsoup html-parsing
1个回答
0
投票

有两件事:

1。)标准html.parser不能很好地解析此页面,请使用lxmlhtml5lib

2。)页面在带有class="honey-pot"的常规广告之间有“ dummy”个广告列表,因此脚本需要照顾它们。

例如:

import requests
from bs4 import BeautifulSoup


url = 'https://www.mudah.my/malaysia/cars-for-sale/perodua?o={page}&q=&so=1&th=1'
headers = {"Accept-Language": "en-US, en;q=0.5"}

for page in range(1, 3):
    soup = BeautifulSoup(requests.get(url.format(page=page), headers=headers).content, 'lxml')

    for title, price in zip(soup.select('#list-view-ads .list_ads:not(.honey-pot) .list_title'),
                            soup.select('#list-view-ads .list_ads:not(.honey-pot) div[class^="ads_price"]')):
        print('{:<60}{}'.format(title.get_text(strip=True), price.get_text(strip=True)))

打印:

Ladies Owner/SE B.Kit-2008 Perodua MYVI 1.3 EZ (A)          RM 15 800
Perodua MYVI 1.3 EZ (A) LIMETED EDITION                     RM 16 800
Perodua MYVI 1.3 SX FACELIFT (M)                            RM 10 990
Perodua VIVA 1.0 (A) ONE OWNER ACC FREE                     RM 9 800
Perodua KELISA 1.0 SE EZS (A) Jaga Baik                     RM 13 990
Perodua MYVI 1.3 EZi (A) PASSO RACY~17" RIMS                RM 22 990
Perodua MYVI 1.3 (A) EZi tru 2007                           RM 14 800
23k KM SUPER CARKING 2010 Perodua MYVI 1.3 EZ (A)           RM 16 800
Perodua MYVI 1.3(M) SX 1 owner Ori mielage                  RM 10 800
Perodua MYVI H/AV 1.5L (A) R3Bat3 2XXX                      RM 50 600
Perodua ARUZ X 1.5L (A) R3BaT3 2XXX                         RM 72 600
Perodua AXIA GXTRA R3BAT3 1XXX                              RM 35 300

...and so on.
© www.soinside.com 2019 - 2024. All rights reserved.