Heroku 生产部署中使用请求和代理的奇怪错误

Question

我制作了一个使用代理的应用程序，并使用 python 中的请求模块检查未索引的网站。当谷歌找不到特定网站时，我会抓取谷歌结果页面，

www.google.com/search?site:{url}&num=3

并检查特定短语！

# checking logic
            response = self.proxy_request(INDEXING_SEARCH_STRING.format(current_url))
            if response.status_code != 200:
                return current_url, False, "failed"
            soup = bs4.BeautifulSoup(response.text, "html.parser")
            not_indexed_regex = re.compile("did not match any documents")
            if soup(text=not_indexed_regex):
                return current_url, False, "checked"
            else:
                print(response.text)
                return current_url, True, "checked"

# proxy requests
    def proxy_request(self, url, **kwargs):
        fail_count = 0
        max_failures = 3  # Adjust this threshold as needed
        print("Evaluating: ", self.url_manager.current_url_index, "URL: ", url)
        while fail_count < max_failures:
            current_proxy = self.proxy_manager.get_proxy_for_request()

            if current_proxy is None:
                ProgressManager.update_progress("All given proxy failed")
                return requests.get(url, **kwargs)
            
            try:
                response = requests.get(url, proxies=current_proxy, timeout=20)
                if response.status_code == 200:
                    print("Success!")
                    self.proxy_manager.update_proxy()
                    return response
                else:
                    print("Failed!",response.status_code)
                    ProgressManager.update_progress("Proxy failing with status code: " + str(response.status_code))
                    time.sleep(0.5)
                    self.proxy_manager.update_proxy()
            except Exception as e:
                print("Failed!", e)
                fail_count += 1
                self.proxy_manager.update_proxy()
                ProgressManager.update_progress(f"Request failed! {e.__class__.__name__}. ")
                break
        time.sleep(5)
        return requests.get(url,timeout=20)

在我的本地计算机上，无论有没有代理，它都可以正常工作。但是当我在 Heroku 上部署它时，它会在某些未编入索引的网站时将其标记为

True

、

"checked"

，而这些网站已由我设备上运行的同一应用程序正确处理。

但是，当没有给出代理时，它可以正常工作，当向它提交代理时，就会出现错误。

此外，如果有任何其他更简单的方法可以绕过

H-12

长时间运行的进程的超时错误，不需要任何额外的服务器运行，请告诉我。

它在本地主机上工作，所以我无法有效地调试部署。有时代理也会出现错误

HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=site:{{URL}}/&num=1 (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 401 Auth Failed ip_blacklisted: 3.85.57.0/24')))

如何解决这个问题？

Answer 1

我发现结果是用不同的语言给出的。因此，指定的模式

did not match any documents

可能会出现，也可能不会出现。

一个简单的解决方案是使用修改后的谷歌查询，

www.google.com/search?site:{url}&num=3&hl=en

hl=en

部分将强制谷歌返回英文页面。

Heroku 生产部署中使用请求和代理的奇怪错误

问题描述投票：0回答：1

1个回答

最新问题

Heroku 生产部署中使用请求和代理的奇怪错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1