Python requests.get(url) 在 Colab 中返回空内容

问题描述 投票:0回答:1

我正在通过请求抓取网站,但是尽管

response.status_code
返回200,但response.text或response.content中没有内容。

另一个包含该代码的站点运行良好,在本地 Jupyter 环境中也运行良好,但由于某些原因我无法通过“Colab”中下面的防火墙 URL。

你能给我一些建议吗?

问题网址:

https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1

import requests
from bs4 import BeautifulSoup as bs

url = 'https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Whale/3.25.232.19 Safari/537.36'}
response = requests.get(url, headers=headers, data={'buscar':100000})
soup = bs(response.content, "html.parser")
soup
<br/>
<br/>
<center>
<h2>
The request / response that are contrary to the Web firewall security policies have been blocked.
</h2>
<table>
<tr>
<td>Detect time</td>
<td>2024-03-12 21:52:05</td>
</tr>
<tr>
<td>Detect client IP</td>
<td>35.236.245.49</td>
</tr>
<tr>
<td>Detect URL</td>
<td>https://gall.dcinside.com/board/view/</td>
</tr>
</table>
</center>
<br/>

我尝试将用户代理、https 更改为 http,以及类似问题的其他建议,一切都不起作用。

python python-requests web-crawler
1个回答
0
投票

如果您在使用 Google Colab 中的 requests 模块发出 HTTP 请求时遇到问题,可能有多种原因导致此行为

1。防火墙或网络限制: 有时,网络或防火墙限制可能会阻止笔记本电脑访问外部资源。如果您位于代理或防火墙后面,您可能需要在笔记本中配置代理设置。

使用以下代码片段在笔记本中设置代理设置:

import os

os.environ['HTTP_PROXY'] = 'http://your_proxy_address:your_proxy_port'
os.environ['HTTPS_PROXY'] = 'http://your_proxy_address:your_proxy_port'

2。被阻止的站点: 如果您尝试访问的网站在 Colab 环境中被阻止,您将无法向其发出请求。

另外,请添加所有可能的标头以避免阻塞。这是代码的修改版本

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse, parse_qs

import os
os.environ['HTTP_PROXY'] = 'http://your_proxy_address:your_proxy_port'
os.environ['HTTPS_PROXY'] = 'http://your_proxy_address:your_proxy_port'

def get_response_by_passing_headers(url):

    # We are parsing query parameters from the URL to pass it to the request
    parsed_url = urlparse(url)
    query_params = parse_qs(parsed_url.query)
    params = {key: value[0] for key, value in query_params.items()}

    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-GB,en;q=0.9',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
        'Pragma': 'no-cache',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Linux"',
        }

    # Making a request with all the headers and parameters
    response = requests.get('https://gall.dcinside.com/board/view/', params=params, headers=headers)
    return response

url = 'https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1'
response = get_response_by_passing_headers(url)
soup = bs(response.content, "html.parser")
print(soup)
© www.soinside.com 2019 - 2024. All rights reserved.