我使用Pandas和BeautifulSoup从Wikipedia刮了一张桌子,得到了一个列表。我想将其转换为数据帧,但是当我使用pd.DataFrame()函数时,结果与预期不符。请帮助。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
一切正常,直到这一点,但是在那之后,当我尝试以下代码时
neigh = pd.DataFrame(df)
它只返回一行和一列输出。
您已经有一个封装在列表中的pandas DataFrame。您只需要考虑第一个元素:
neigh = df[0]
print(neigh)
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
.. ... ... ...
282 M8Z Etobicoke Mimico NW
283 M8Z Etobicoke The Queensway West
284 M8Z Etobicoke Royal York South West
285 M8Z Etobicoke South of Bloor
286 M9Z Not assigned Not assigned
[287 rows x 3 columns]
您可以使用pandas
,read_html
函数直接从URL中读取表
>>> url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
>>> tables = pd.read_html(url)
>>> len(tables)
3
>>> tables[0]
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
.. ... ... ...
282 M8Z Etobicoke Mimico NW
283 M8Z Etobicoke The Queensway West
284 M8Z Etobicoke Royal York South West
285 M8Z Etobicoke South of Bloor
286 M9Z Not assigned Not assigned
[287 rows x 3 columns]
>>> type(tables[0])
<class 'pandas.core.frame.DataFrame'>
read_html
将从URL中读取所有的table
标记并返回dataframes
的列表
您在df中已经有数据框
print(df[0])