维基百科刮刮 - 需要帮助来构建它

问题描述 投票:3回答:1

我正试图刮掉this Wikipedia page

我遇到了一些问题,非常感谢您的帮助:

  1. 有些行有多个名称或链接,我希望将它们全部分配到正确的国家/地区。无论如何我能做到吗?
  2. 我想跳过“姓名(本机)”列。我怎样才能做到这一点?
  3. 如果我正在抓取“姓名(本地)”列。我得到一些胡言乱语,无论如何要编码吗?
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')
table = soup.find('table', class_='wikitable').tbody

rows = table.findAll('tr')

columns = [col.text.encode('utf').replace('\xc2\xa0','').replace('\n', '') for col in rows[1].find_all('td')]
print(columns)
python pandas python-2.7 beautifulsoup wikipedia
1个回答
2
投票

您可以使用pandas function read_html并从DataFrame列表中获取第二个DataFrames

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
df = pd.read_html(url)[1].head()
print (df)
       Country/region                                              Name  \
0              Albania       Official Gazette of the Republic of Albania   
1              Algeria                                  Official Gazette   
2              Andorra  Official Bulletin of the Principality of Andorra   
3  Antigua and Barbuda              Antigua and Barbuda Official Gazette   
4            Argentina     Official Gazette of the Republic of Argentina   

                                 Name (native)                    Website  
0  Fletorja Zyrtare E Republikës Së Shqipërisë                 qbz.gov.al  
1                   Journal Officiel d'Algérie              joradp.dz/HAR  
2     Butlletí Oficial del Principat d'Andorra                www.bopa.ad  
3         Antigua and Barbuda Official Gazette    www.legalaffairs.gov.ag  
4    Boletín Oficial de la República Argentina  www.boletinoficial.gob.ar 

如果检查输出有问题的行26,因为错误的数据也在维基页面。

解决方案应按列名和行设置:

df.loc[26, 'Name (native)'] = np.nan 
© www.soinside.com 2019 - 2024. All rights reserved.