Python请求返回其他随机URL的内容

问题描述 投票:0回答:1

所以,当我试图用python请求库刮取一个网页时,我有一个奇怪的行为。出于某种我不明白的原因,当我刮取一个网页的内容时,我得到了另一个明显随机的网页的数据。下面是一个例子。

import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    """
    Function to scrape some data from given url
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    data = {'event_title': soup.find('h1').text.lower()}
    data['event_date'] = soup.find('li', {'class': 'header'}).text.split()[1]

    return data

# Test URL 
url = 'https://www.tapology.com/fightcenter/events/67412-ufc-on-espn-33'

# First try returns the correct info
first = scrape_webpage(url)
print(first)   
# {'event_date': '05.16.2020', 'event_title': 'ufc fight night: overeem vs. harris'}

# A second try changing nothing returns wrong info
second = scrape_webpage(url)
print(second)
# {'event_date': '06.20.2020', 'event_title': 'efm 3'}

# A third try also fails to retrieve the correct data
third = scrape_webpage(url)
print(third)
# {'event_date': '10.05.2010', 'event_title': 'bystriy fight club 1'}

这种行为毫无逻辑地重复着。另外值得一提的是,我是用Google Colab来做这件事的。如果我尝试刮取一个URL列表,只有第一个得到正确的数据(而且只有当它是第一次尝试时),其余的都从一个随机的URL返回数据。所以问题是,如何修复这种行为?

python web-scraping python-requests urllib3
1个回答
1
投票

你应该模仿一个真正的浏览器,至少可以通过一个User-Agent来实现。

def scrape_webpage(url):
    #s = session()
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4103.61 Safari/537.36"
    }
    """
    Function to scrape some data from given url
    """
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    data = {'event_title': soup.find('h1').text.lower()}
    data['event_date'] = soup.find('li', {'class': 'header'}).text.split()[1]

    return data

# Test URL 
url = 'https://www.tapology.com/fightcenter/events/67412-ufc-on-espn-33'

for x in range(10):
    # A second try changing nothing returns wrong info
    second = scrape_webpage(url)
    print(second)

输出。

{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
© www.soinside.com 2019 - 2024. All rights reserved.