使用 BeautifulSoap 库从网络获取表格时没有响应

问题描述 投票:0回答:1

我正在尝试从web获取两个数据表。我正在使用 Google Colab 的 BeautifulSoup Python 库。下载网址如下:https://www.avamet.org/mx-consulta-diaria.php?id=%%25%%2525&ini=2024-02-27&fin=2024-02-28&token=s0141% 21 你可以看到那里有两张桌子!

我正在尝试执行以下操作:

import requests
import pandas as pd
from bs4 import BeautifulSoup

def get_avamet_data(START_DATE=None, END_DATE=None, region='%%25%%25'):

  # All stations: %%25%%25  -------------------------------------------> Doesn't work
  # Only the station 15: c15%25 ---------------------------------------> It works!!!

  url = 'https://www.avamet.org/mx-consulta-diaria.php?id=' + region + '25&ini=' + START_DATE + '&fin=' + END_DATE + '&token=s0141%21'

  print('Getting data from: ' + url)

  response = requests.get(url)

  web_html = BeautifulSoup(response.content, 'html.parser')

  selector = 'table'
  tables = web_html.find_all(selector)

  table_1 = tables[0]
  table_2 = tables[1]

  data_1 = []
  for row in table_1.find_all('tr'):
    cols = row.find_all(['td', 'th'])
    cols = [ele.text.strip() for ele in cols]
    data_1.append([ele for ele in cols if ele])

  data_2 = []
  for row in table_2.find_all('tr'):
    cols = row.find_all(['td', 'th'])
    cols = [ele.text.strip() for ele in cols]
    data_2.append([ele for ele in cols if ele])

  return([pd.DataFrame(data_1), pd.DataFrame(data_2)])

然后:

a = get_avamet_data(START_DATE='2024-02-27', END_DATE='2024-02-28')

a[0]
a[1]

但是我在

tables
变量中获得了一个空列表。但是,当我将区域参数从
'%%25%%25'
更改为
'c15%25'
时,它就起作用了。

问题出在哪里?

python html web-scraping beautifulsoup google-colaboratory
1个回答
0
投票

您可以使用

pandas.read_html()
直接将表格抓取到数据框中:

import pandas as pd

def get_avamet_data(START_DATE=None, END_DATE=None, region='%%25%%25'):

  url = 'https://www.avamet.org/mx-consulta-diaria.php?id=' + region + '25&ini=' + START_DATE + '&fin=' + END_DATE + '&token=s0141%21'

  print('Getting data from: ' + url)

  return(pd.read_html(url))

a = get_avamet_data(START_DATE='2024-02-27', END_DATE='2024-02-28')

print(a[0])
print(a[1])
© www.soinside.com 2019 - 2024. All rights reserved.