所以,当我试图用python请求库刮取一个网页时,我有一个奇怪的行为。出于某种我不明白的原因,当我刮取一个网页的内容时,我得到了另一个明显随机的网页的数据。下面是一个例子。
import requests
from bs4 import BeautifulSoup
def scrape_webpage(url):
"""
Function to scrape some data from given url
"""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = {'event_title': soup.find('h1').text.lower()}
data['event_date'] = soup.find('li', {'class': 'header'}).text.split()[1]
return data
# Test URL
url = 'https://www.tapology.com/fightcenter/events/67412-ufc-on-espn-33'
# First try returns the correct info
first = scrape_webpage(url)
print(first)
# {'event_date': '05.16.2020', 'event_title': 'ufc fight night: overeem vs. harris'}
# A second try changing nothing returns wrong info
second = scrape_webpage(url)
print(second)
# {'event_date': '06.20.2020', 'event_title': 'efm 3'}
# A third try also fails to retrieve the correct data
third = scrape_webpage(url)
print(third)
# {'event_date': '10.05.2010', 'event_title': 'bystriy fight club 1'}
这种行为毫无逻辑地重复着。另外值得一提的是,我是用Google Colab来做这件事的。如果我尝试刮取一个URL列表,只有第一个得到正确的数据(而且只有当它是第一次尝试时),其余的都从一个随机的URL返回数据。所以问题是,如何修复这种行为?
你应该模仿一个真正的浏览器,至少可以通过一个User-Agent来实现。
def scrape_webpage(url):
#s = session()
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4103.61 Safari/537.36"
}
"""
Function to scrape some data from given url
"""
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
data = {'event_title': soup.find('h1').text.lower()}
data['event_date'] = soup.find('li', {'class': 'header'}).text.split()[1]
return data
# Test URL
url = 'https://www.tapology.com/fightcenter/events/67412-ufc-on-espn-33'
for x in range(10):
# A second try changing nothing returns wrong info
second = scrape_webpage(url)
print(second)
输出。
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}