网站总是返回一个<Response 403>

Question

我正在尝试使用请求库从网站 (cardmarket) 获取 html 页面，但我总是收到 403 响应。我指出他们添加了一个“连接安全检查”页面，每次我删除 cookie 时它都会重新出现。我认为他们刚刚添加了 Cloudflare。

我在网上看到可能是sent headers的问题，说用proxy会好点，结果没变。这是我使用的代码（我使用包含不同标题的文件）：

import requests
import pandas as pd
import yaml

with open("headers.yml") as f_headers:
    browser_headers = yaml.safe_load(f_headers)

proxy_list = pd.read_html(response.text)[0]
proxy_list["url"] = "http://" + proxy_list["IP Address"] + ":" + proxy_list["Port"].astype(str)
print(proxy_list.head())

https_proxies = proxy_list[proxy_list["Https"] == "yes"]
https_proxies.count()

url = "https://httpbin.org/ip"
good_proxies = set()
headers = browser_headers["Chrome"]
for proxy_url in https_proxies["url"]:
    proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }
    
    try:
        response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
        good_proxies.add(proxy_url)
        print(f"Proxy {proxy_url} OK, added to good_proxy list")
    except Exception:
        pass


url = "https://www.cardmarket.com/en/Magic"
for browser, headers in browser_headers.items():
    print(f"\n\nUsing {browser} headers\n")
    for proxy_url in good_proxies:
        proxies = proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }
        #print(requests.get(url, headers=headers, proxies=proxies, timeout=2))
        try:
            response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
            print(response.json())
        except Exception:
            print(f"Proxy {proxy_url} failed, trying another one")

我刚接触爬虫，所以我不明白为什么它一直向我返回这个响应。有人可以解释为什么以及如何解决这个问题吗？

Answer 1

我已阅读您的关注。您能否先检查该网站是否允许使用代理？我在此页面上访问了他们的使用条款：https://www.cardmarket.com/en/Magic/Policies/GeneralTermsAndConditions，在我看来，该网站已经为第三方提供了 API。当访问 API 会更快时，为什么要使用抓取？如果网页抓取在此站点上不违法，请先尝试增加超时。

我也发现访问本站时，由于安全验证，无法直接访问主页。本网站可能包含代理检测技术。如果不违法，您可以尝试使用本教程，例如，模拟一个可以执行验证码操作的网络浏览器：https://www.quora.com/How-do-you-bypass-Captcha-何时刮.

Answer 2

如果有人正在为这个问题苦苦挣扎，我找到了一个解决方案，使用 undetected_chromedriver，我在here 找到了它。这是我使用的代码：

import undetected_chromedriver as uc
import time 

text_file = open("output.txt", "w")

options = uc.ChromeOptions() 
options.add_argument('--headless')
driver = uc.Chrome(use_subprocess=True, options=options) 
page = driver.get("https://www.cardmarket.com/en/Magic/Products/Singles/March-of-the-Machine/Ozolith-the-Shattered-Spire") 
driver.maximize_window() 
time.sleep(6)
html_code = driver.page_source  # get the HTML code of the page
text_file.write(html_code)  # write the HTML code to the text file
driver.save_screenshot("datacamp.png") 
driver.close()

text_file.close()

网站总是返回一个<Response 403>

问题描述投票：0回答：2

2个回答

最新问题

网站总是返回一个<Response 403>

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2