我正在尝试从web获取两个数据表。我正在使用 Google Colab 的 BeautifulSoup Python 库。下载网址如下:https://www.avamet.org/mx-consulta-diaria.php?id=%%25%%2525&ini=2024-02-27&fin=2024-02-28&token=s0141% 21 你可以看到那里有两张桌子!
我正在尝试执行以下操作:
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_avamet_data(START_DATE=None, END_DATE=None, region='%%25%%25'):
# All stations: %%25%%25 -------------------------------------------> Doesn't work
# Only the station 15: c15%25 ---------------------------------------> It works!!!
url = 'https://www.avamet.org/mx-consulta-diaria.php?id=' + region + '25&ini=' + START_DATE + '&fin=' + END_DATE + '&token=s0141%21'
print('Getting data from: ' + url)
response = requests.get(url)
web_html = BeautifulSoup(response.content, 'html.parser')
selector = 'table'
tables = web_html.find_all(selector)
table_1 = tables[0]
table_2 = tables[1]
data_1 = []
for row in table_1.find_all('tr'):
cols = row.find_all(['td', 'th'])
cols = [ele.text.strip() for ele in cols]
data_1.append([ele for ele in cols if ele])
data_2 = []
for row in table_2.find_all('tr'):
cols = row.find_all(['td', 'th'])
cols = [ele.text.strip() for ele in cols]
data_2.append([ele for ele in cols if ele])
return([pd.DataFrame(data_1), pd.DataFrame(data_2)])
然后:
a = get_avamet_data(START_DATE='2024-02-27', END_DATE='2024-02-28')
a[0]
a[1]
但是我在
tables
变量中获得了一个空列表。但是,当我将区域参数从 '%%25%%25'
更改为 'c15%25'
时,它就起作用了。
问题出在哪里?
您可以使用
pandas.read_html()
直接将表格抓取到数据框中:
import pandas as pd
def get_avamet_data(START_DATE=None, END_DATE=None, region='%%25%%25'):
url = 'https://www.avamet.org/mx-consulta-diaria.php?id=' + region + '25&ini=' + START_DATE + '&fin=' + END_DATE + '&token=s0141%21'
print('Getting data from: ' + url)
return(pd.read_html(url))
a = get_avamet_data(START_DATE='2024-02-27', END_DATE='2024-02-28')
print(a[0])
print(a[1])