使用 python 和 beautifulsoup 进行网站表抓取返回“none”或空

问题描述 投票:0回答:1

我试图通过一个简单的请求从该表中抓取数据,但在尝试使用表类后,它返回“none”:

table = soup.find("table", class_ = "hp")

尝试任何表都会返回空:

table = soup.find_all("table")

我该如何解决这个问题?

完整代码如下:

import requests
import pandas as pd 
from bs4 import BeautifulSoup

url = "https://aviation-safety.net/database/year/2024/1"
response = requests.get(url)
#print(response)

soup = BeautifulSoup(response.text, "lxml")

table = soup.find("table", class_ = "hp")
print(table)

我使用 pandas 的目的是稍后归档 .csv。

python web-scraping beautifulsoup
1个回答
0
投票

您被网站阻止,很可能是因为默认的请求用户代理标头 (MDN),

python-requests/<version>

如果您检查

response.text
的值,它会显示类似

Sorry, something went wrong. You can contact us via <email>, should the problem persist.

您应该将 User-Agent 标头设置为不同的内容。例如,

import requests
import pandas as pd 
from bs4 import BeautifulSoup

url = "https://aviation-safety.net/database/year/2024/1"
response = requests.get(url, headers={"User-Agent": "your-user-agent-string")
#print(response.text)

soup = BeautifulSoup(response.text, "lxml")

table = soup.find("table", class_ = "hp")
print(table)
© www.soinside.com 2019 - 2024. All rights reserved.