我在数据清理方面遇到了一些问题

问题描述 投票:0回答:1

我从维基百科页面上删了一张桌子,接下来我要清理数据。我已将数据转换为Pandas格式,现在我在清理数据时遇到了一些问题

以下是我从维基百科页面中删除表格时执行的代码

import requests
import pandas as pd
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
PostalCode=[]
for row in My_table.findAll('tr')[1:]:
    PostalCode_cell=row.findAll('td')[0]
    PostalCode.append(PostalCode_cell.text)    
print(PostalCode)   
Borough=[]
for row in My_table.findAll('tr')[1:] :
    Borough_cell=row.findAll('td')[1]
    Borough.append(Borough_cell.text)   
print(Borough)
Neighbourhood=[]
for row in My_table.findAll('tr')[1:]:
    Neighbourhood_cell=row.findAll('td')[2]
    Neighbourhood_cell.text.rstrip('\n')
    Neighbourhood.append(Neighbourhood_cell.text)
print(Neighbourhood)
canada=pd.DataFrame({'PostalCode':PostalCode,'Borough':Borough,'Neighborhood':Neighbourhood})
canada.rename(columns = {'PostalCode':'PostalCode','Borough':'Borough','Neighborhood':'Neighborhood'}, inplace = True) 
canada

我已经尝试了groupby函数,希望获得第二个期望的结果,但没有成功:

canada.groupby(['PostalCode', 'Borough'])

我试图从自治市镇中删除“未分配”值:

canada=canada.Borough.drop("Not assigned",axis=0)

但它显示:“轴未找到['未分配']”

以下是我清理数据的预期结果:1。忽略Borough 2中值为“未分配”的单元格。对于具有相同PostalCode和Borough的邻域,它们应显示在同一行中并用逗号3分隔。如果单元格有一个自治市镇,但是一个“未分配”的社区,邻居将与自治市镇相同

而且,我注意到我刮过的表在邻域中的每个值的末尾都包含“\ n”。我是否应该在抓取过程中添加任何代码来摆脱它?

非常感谢您的帮助。

python pandas dataframe data-cleaning
1个回答
0
投票

这感觉有点长啰嗦。

import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
canada = tables[0]
canada.columns = canada.iloc[0]
canada = canada.iloc[1:]
canada = canada[canada.Borough != 'Not assigned']
canada['Neighbourhood'].loc[canada['Neighbourhood'] == 'Not assigned'] =  canada.Borough
canada['Location'] = canada.Borough + ', ' + canada.Neighbourhood
canada.drop(['Borough', 'Neighbourhood'], axis=1, inplace = True)
canada.reset_index(drop=True)

参考文献:

https://stackoverflow.com/a/49161313/6241235

编辑:

我认为@ bubble关于不区分大小写搜索的观点是一个好的,他们说canada = canada[canada.loc[:, 'Borough'].str.contains('Not assigned', case=False)],但我没有想到这一点)

© www.soinside.com 2019 - 2024. All rights reserved.