使用beautifulsoup删除Wikipedia表

Question

我一直在尝试使用Beautifulsoup在Wikipedia上抓表，但是遇到了一些问题。

页：https://en.wikipedia.org/wiki/New_York_City表：enter image description here

表格：“种族组成”

在页面源中，表似乎从第1470行开始。

这是我首先尝试的代码：

website_url = requests.get('https://en.wikipedia.org/wiki/New_York_City').text
soup = BeautifulSoup(website_url,'lxml')
table = soup.find('table',{'class':'wikitable sortable collapsible'})

headers = [header.text for header in table.find_all('th')]

table_rows = table.find_all('tr')        
rows = []
for row in table_rows:
   td = row.find_all('td')
   row = [row.text for row in td]
   rows.append(row)

with open('NYC_DEMO.csv', 'w') as f:
   writer = csv.writer(f)
   writer.writerow(headers)
   writer.writerows(row for row in rows if row)

这是错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-e6000bdafe11> in <module>
      3 table = soup.find('table',{'class':'wikitable sortable collapsible'})
      4 
----> 5 headers = [header.text for header in table.find_all('th')]
      6 
      7 table_rows = table.find_all('tr')

AttributeError: 'NoneType' object has no attribute 'find_all'

我想这是我们需要从Wikipedia页面获得的代码：

<tbody><tr>
<th>Racial composition</th>
<th>2010<sup id="cite_ref-QuickFacts2010_226-1" class="reference"><a href="#cite_note-QuickFacts2010-226">&#91;224&#93;</a></sup></th>
<th>1990<sup id="cite_ref-pop_228-0" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup></th>
<th>1970<sup id="cite_ref-pop_228-1" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup></th>
<th>1940<sup id="cite_ref-pop_228-2" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup>
</th></tr>
<tr>
<td><a href="/wiki/White_American" class="mw-redirect" title="White American">White</a></td>
<td>44.0%</td>
<td>52.3%</td>
<td>76.6%</td>
<td>93.6%
</td></tr>
<tr>
...

我猜它找不到正确的桌子？该页面上有很多表格，所以如何正确指向该表格？

谢谢您的帮助。

Answer 1

我猜它找不到正确的桌子？

似乎是这样，是的。如果检查table的值，将会看到它是None，这就是为什么在其上调用find_all失败的原因。

[如果您检查页面上的表，您会看到其类为wikitable collapsible collapsed mw-collapsible mw-made-collapsible，并且其中没有sortable类。这就是为什么您的程序找不到任何匹配的table元素的原因。

该页面上有很多表，所以我如何正确指向该表？

首先，您可以连接一些唯一的标识符，例如元素的id，但您的情况下没有可用的标识符。如果它有任何thead或某种形式的标题，您可以尝试这样做，但同样，不是这种情况。

然后，您需要进一步深入DOM树，并检查其父级是否具有唯一标识符。按照计划，您可以在选择器中添加父项。不幸的是，Wikipedia文章的正文似乎只是封装在一个大元素中，而没有在语义上分开这些部分。这使得更难刮擦。

至此，我只剩下看浏览器页面并思考如何自然地（非编程方式）识别表。您查看一下，发现标题中有种族组成。你可以用类似的东西来抓它

table_heading = soup.find('th', text='Racial composition')      # this gives you the `th`
if table_heading:
    table = table_heading.find_parents('table')

可能还有一些我不知道的其他beautifulsoup API，但是您可以将其放在代码中，并且应该可以使用。

Answer 2

问题是它不会返回带有class="wikitable sortable collapsible"的表，因为它不在html中显式。您将需要使用正则表达式来找到包含该子字符串的类，因为这将起作用。其次，.find()将仅返回其找到的第一个元素。除非您尝试获取的表具有特定且唯一的属性来标识它，否则使用.find()将不起作用。如果有多个元素，则需要使用.find_all()，即使如此，您也需要遍历这些元素以获得所需的表。

正如某人所说，您也可以使用熊猫的.read_html()。这将返回列表中的所有表标签，然后只需查找所需表的索引位置即可。我为您提供了两种选择：

使用熊猫：

import pandas as pd

url = 'https://en.wikipedia.org/wiki/New_York_City'

df = pd.read_html(url)[9]
df.to_csv('NYC_DEMO.csv',index=False)

使用BeautifulSoup：

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/New_York_City'
website_url = requests.get(url).text
soup = BeautifulSoup(website_url,'html.parser')
tables = soup.find_all('table')
for table in tables:
    if 'Racial composition' in table.text:
        headers = [header.text.strip() for header in table.find_all('th')]
        rows = []
        table_rows = table.find_all('tr')    
        for row in table_rows:
           td = row.find_all('td')
           row = [row.text for row in td]
           rows.append(row)

df = pd.DataFrame(rows, columns=headers)

输出：

print (df)
                 Racial composition 2010[224] 1990[226]   1970[226] 1940[226]
0                             White     44.0%     52.3%       76.6%     93.6%
1                     —Non-Hispanic     33.3%     43.2%  62.9%[227]     92.0%
2         Black or African American     25.5%     28.7%       21.1%      6.1%
3  Hispanic or Latino (of any race)     28.6%     24.4%  16.2%[227]      1.6%
4                             Asian     12.7%      7.0%        1.2%         –

使用beautifulsoup删除Wikipedia表

问题描述投票：0回答：2

2个回答

最新问题

使用beautifulsoup删除Wikipedia表

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2