使用 Python 进行网页抓取时如何避免 HTTP 错误 403？

Question

当我尝试使用此代码来抓取网页时：

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

我收到如下错误：

  File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

该网站是否认为我是机器人？我该如何解决这个问题？

Answer 1

这可能是因为

mod_security

或某些类似的服务器安全功能阻止了已知的蜘蛛/机器人用户代理（

urllib

使用类似

python urllib/3.3.0

的内容，很容易检测到）。尝试使用以下方式设置已知的浏览器用户代理：

from urllib.request import Request, urlopen

req = Request(
    url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', 
    headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

这对我有用。

顺便说一句，在您的代码中，您在

()

行中的

.read

之后缺少

urlopen

，但我认为这是一个拼写错误。

提示：由于这是练习，因此请选择其他非限制性站点。也许他们出于某种原因阻止了

urllib

...

Answer 2

由于您使用基于用户代理的 urllib，它肯定会被阻塞。 OfferUp 也发生了同样的事情。您可以创建一个名为 AppURLopener 的新类，它用 Mozilla 覆盖用户代理。

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')

来源

Answer 3

“这可能是因为 mod_security 或一些类似的服务器安全功能阻止了已知的

蜘蛛/机器人

用户代理（urllib 使用类似 python urllib/3.3.0 的东西，很容易检测到）” - 正如 Stefano Sanfilippo 已经提到的

from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

web_byte是服务器返回的字节对象，网页中存在的内容类型大多是utf-8。因此，您需要使用解码方法来解码web_byte。

这解决了我在尝试使用 PyCharm 从网站上抓取

时的完整问题

P.S -> 我使用 python 3.4

Answer 4

根据之前的答案，这对我来说对 Python 3.7 有效，将超时增加到 10。

from urllib.request import Request, urlopen

req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Answer 5

将 cookie 添加到请求标头对我有用

from urllib.request import Request, urlopen

# Function to get the page content
def get_page_content(url, head):
  """
  Function to get the page content
  """
  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)

Answer 6

如果您对将用户代理伪装成 Mozilla 感到内疚（Stefano 的最佳答案中的评论），它也可以与非 urllib 用户代理一起使用。这适用于我参考的网站：

    req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
    urlrequest.urlopen(req, timeout=10).read()

我的应用程序是通过抓取我在文章中引用的特定链接来测试有效性。不是通用的刮刀。

Answer 7

由于该页面在浏览器中运行，而不是在 python 程序中调用时运行，因此服务于 url 的 Web 应用程序似乎识别出您不是通过浏览器请求内容。

演示：

curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1

...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>

r.txt 中的内容有状态行：

HTTP/1.1 403 Forbidden

尝试发布标头“User-Agent”，这会伪造网络客户端。

注意： 该页面包含 Ajax 调用，用于创建您可能想要解析的表。您需要检查页面的 javascript 逻辑，或者仅使用浏览器调试器（如 Firebug / Net 选项卡）来查看需要调用哪个 url 来获取表格的内容。

Answer 8

你可以像这样使用 urllib 的 build_opener ：

opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'), ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'), ('Accept-Encoding','gzip, deflate, br'),\
    ('Accept-Language','en-US,en;q=0.5' ), ("Connection", "keep-alive"), ("Upgrade-Insecure-Requests",'1')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(url, "test.xlsx")

Answer 9

您可以尝试两种方法。详细信息在此链接。

1）通过点

pip install --升级证书

2) 如果不起作用，请尝试运行 Mac 版 Python 3.* 附带的 Cerificates.command：（转到你的 Python 安装位置并双击该文件）

打开/Applications/Python\ 3.*/Install\ Certificates.command

Answer 10

我遇到了同样的问题，但无法使用上面的答案解决它。我最终通过使用 requests.get() 然后使用结果的 .text 而不是使用 read() 解决了这个问题：

from requests import get

req = get(link)
result = req.text

Answer 11

一种简单直接的方法：

from bs4 import BeautifulSoup
import requests

response = requests.get(url)
web_page = response.text

soup = BeautifulSoup(web_page, "html.parser")

Answer 12

我为此绞尽脑汁一段时间，结果答案非常简单。我检查了响应文本，收到“URL 签名已过期”消息，除非您检查响应文本，否则您通常不会看到这条消息。

这意味着某些 URL 刚刚过期，通常是出于安全目的。尝试再次获取 URL 并更新脚本中的 URL。如果您尝试抓取的内容没有新的 URL，那么很遗憾您无法抓取它。

Answer 13

打开开发者工具并打开网络水龙头。选择您想要废弃的项目，扩展详细信息将包含用户代理并将其添加到那里

Answer 14

有时候很多技巧都不起作用。所以最后的办法就是获取Google Cache的内容。

import requests

# The headers 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}

# The URL you want to scrap
url_2_scrap = 'https://www.my_url.com'

# Full URL to get the content 
url_full = 'https://webcache.googleusercontent.com/search?q=cache:' + url_2_scrap

# Response of the request
response = requests.get(url_full, headers=headers)

# If the status is good,
if response.status_code == 200:
    print("OK! It works fine! ;-)")
# If its not good,
else:
    print("It doesn't work :-(")

使用 Python 进行网页抓取时如何避免 HTTP 错误 403？

问题描述投票：0回答：14

14个回答

最新问题

使用 Python 进行网页抓取时如何避免 HTTP 错误 403？

问题描述 投票：0回答：14

14个回答

最新问题

问题描述投票：0回答：14