Web抓取-通过“兄弟”标签中的文本获取标签-漂亮的汤

问题描述 投票:0回答:1

我正在尝试将文本放入Wikipedia的表格中,但在很多情况下(在这种情况下都是书),我都会这样做。我想获得这本书的类型。

Html code for the page

当流派中的文本时,我需要提取包含流派的td。

我这样做:

page2 = urllib.request.urlopen(url2)

soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
    for tr in table.findAll('tr')[5:6]:
        for td in tr.findAll('td'):
            print(td.getText(separator="\n"))```

This gets me the genre but only in some pages due to the row count which differs. 

Example of page where this does not work 

https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)

Anyone knows how to search through string with "genre"? Thank you
beautifulsoup wikipedia
1个回答
0
投票

在这种情况下,您不必为所有这些麻烦。只需尝试:

import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])

输出:

                     0                                       1
0   First edition cover                     First edition cover
1                Author                          J. D. Salinger
2          Cover artist               E. Michael Mitchell[1][2]
3               Country                           United States
4              Language                                 English
5                 Genre  Realistic fictionComing-of-age fiction
6             Published                           July 16, 1951
7             Publisher               Little, Brown and Company
8            Media type                                   Print
9                 Pages                          234 (may vary)
10                 OCLC                                  287628
11        Dewey Decimal                                  813.54

从这里您可以使用标准的pandas方法提取所需的内容。

© www.soinside.com 2019 - 2024. All rights reserved.