无法使用BeautifulSoup4(初学者)抓取正确的Wikitable

问题描述 投票:0回答:1

这里是一个完整的初学者...我正在尝试从此Wikipedia page中刮除成分表,但是刮除的表是年度收益(第一张表),而不是我需要的成分表(第二张表)。有人可以帮忙看看是否有任何方法可以使用BeautifulSoup4定位到想要的特定表?

import bs4 as bs
import pickle
import requests

def save_klci_tickers():
    resp = requests.get ('https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI')
    soup = bs.BeautifulSoup(resp.text)
    table = soup.find ('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll ('tr') [1:]:
        ticker = row.findAll ('td') [0].text
        tickers.append(ticker)

    with open ("klcitickers.pickle", "wb") as f:
        pickle.dump (tickers, f)

    print (tickers)
    return tickers


save_klci_tickers()
python web-scraping beautifulsoup datatable wikipedia
1个回答
0
投票

find方法在第一个合格的HTML标记处停止,而findAll则将它们全部获取。由于所需的表不是第一张,因此需要使用findAll。这应该工作。

import bs4 as bs
import pickle
import requests

def save_klci_tickers():
    resp = requests.get(
        'https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI')
    soup = bs.BeautifulSoup(resp.text)
    table = soup.findAll('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table[1].findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)

    with open("klcitickers.pickle", "wb") as f:
        pickle.dump(tickers, f)

    print(tickers)
    return tickers

save_klci_tickers()
© www.soinside.com 2019 - 2024. All rights reserved.