如何通过python webScrapping避免“请确认您是人类”?

问题描述 投票:0回答:1

我一直在尝试使用python获取网站的一些信息。我曾尝试使用请求和硒来获取网站的HTML代码,但我总是会得到此HTML。我猜该网站意识到不是真正的人在进行搜索,因此拒绝访问。有什么方法可以解决此问题并获取该网站的HTML代码?

<html lang="en"><head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Access to this page has been denied.</title>
    <link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet">
    <style>
        html, body {
            margin: 0;
            padding: 0;
            font-family: 'Open Sans', sans-serif;
            color: #000;
        }

        a {
            color: #c5c5c5;
            text-decoration: none;
        }

        .container {
            align-items: center;
            display: flex;
            flex: 1;
            justify-content: space-between;
            flex-direction: column;
            height: 100%;
        }

        .container > div {
            width: 100%;
            display: flex;
            justify-content: center;
        }

        .container > div > div {
            display: flex;
            width: 80%;
        }

        .customer-logo-wrapper {
            padding-top: 2rem;
            flex-grow: 0;
            background-color: #fff;
            visibility: visible;
        }

        .customer-logo {
            border-bottom: 1px solid #000;
        }

        .customer-logo > img {
            padding-bottom: 1rem;
            max-height: 50px;
            max-width: 100%;
        }

        .page-title-wrapper {
            flex-grow: 2;
        }

        .page-title {
            flex-direction: column-reverse;
        }

        .content-wrapper {
            flex-grow: 5;
        }

        .content {
            flex-direction: column;
        }

        .page-footer-wrapper {
            align-items: center;
            flex-grow: 0.2;
            background-color: #000;
            color: #c5c5c5;
            font-size: 70%;
        }

        @media (min-width: 768px) {
            html, body {
                height: 100%;
            }
        }
    </style>
    <!-- Custom CSS -->

        <link rel="stylesheet" type="text/css" href="https://d33a4decm84gsn.cloudfront.net/static/partners/perimeterx/perimeterx.css">

<script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/zItNOfzbrqVGbb4QFYpPpcrw/recaptcha__es.js"></script><script src="/Z5wgH7n9/captcha/captcha.js?a=c&amp;u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&amp;v=&amp;m=0"></script><script src="https://www.recaptcha.net/recaptcha/api.js?hl=es-ES"></script><script src="/Z5wgH7n9/init.js"></script><a tabindex="-1" aria-hidden="true" href="/colleges/yale-university/?_pxhc=1587174500133" rel="nofollow" target="_blank" style="width: 0px; height: 0px; font-size: 0px; line-height: 0;"></a></head>

<body>
<section class="container">
    <div class="customer-logo-wrapper">
        <div class="customer-logo">
            <img src="https://www.niche.com/about/wp-content/themes/niche-about/images/about-home/stacked-green.svg" alt="Logo">
        </div>
    </div>
    <div class="page-title-wrapper">
        <div class="page-title">
            <h1>Please verify you are a human</h1>
        </div>
    </div>
    <div class="content-wrapper">
        <div class="content">

            <div id="px-captcha"><div class="g-recaptcha" data-sitekey="6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b" data-callback="handleCaptcha" data-theme="dark"><div style="width: 304px; height: 78px;"><div><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&amp;k=6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b&amp;co=aHR0cHM6Ly93d3cubmljaGUuY29tOjQ0Mw..&amp;hl=es&amp;v=zItNOfzbrqVGbb4QFYpPpcrw&amp;theme=dark&amp;size=normal&amp;cb=19z4nanjwlu" width="304" height="78" role="presentation" name="a-s7me84fdbal4" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div><iframe style="display: none;"></iframe></div></div>
            <p>
                Access to this page has been denied because we believe you are using automation tools to browse the
                website.
            </p>
            <p>
                This may happen as a result of the following:
            </p>
            <ul>
                <li>
                    Javascript is disabled or blocked by an extension (ad blockers for example)
                </li>
                <li>
                    Your browser does not support cookies
                </li>
            </ul>
            <p>
                Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking
                them from loading.
            </p>
            <p>
                Reference ID: #ad14b320-8116-11ea-9d99-a1ff7eeb44e0
            </p>
        </div>
    </div>
    <div class="page-footer-wrapper">
        <div class="page-footer">
            <p>
                Powered by
                <a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a>
                , Inc.
            </p>
        </div>
    </div>
</section>
<!-- Px -->
<script>
    window._pxAppId = 'PXZ5wgH7n9';
    window._pxJsClientSrc = '/Z5wgH7n9/init.js';
    window._pxFirstPartyEnabled = true;
    window._pxVid = '';
    window._pxUuid = 'ad14b320-8116-11ea-9d99-a1ff7eeb44e0';
    window._pxHostUrl = '/Z5wgH7n9/xhr';
</script>
<script>
    var s = document.createElement('script');
    s.src = '/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0';
    var p = document.getElementsByTagName('head')[0];
    p.insertBefore(s, null);
    if (true) {
        s.onerror = function () {
            s = document.createElement('script');
            var suffixIndex = '/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0'.indexOf('captcha.js');
            var temperedBlockScript = '/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0'.substring(suffixIndex);
            s.src = '//captcha.px-cdn.net/PXZ5wgH7n9/' + temperedBlockScript;
            p.parentNode.insertBefore(s, p);
        };
    }
</script>

<!-- Custom Script -->



</body></html>
python selenium python-requests screen-scraping
1个回答
0
投票

很明显,该网站能够识别您的机器人。由于我不知道您要抓取哪个网站,因此无法确定此特定方法是否有效。

尝试更改用户代理。默认情况下,chromedriver的用户代理与通常的Chrome浏览器不同。

from selenium.webdriver.chrome.options import Options
from selenium import webdriver

options = Options()
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36")

driver = webdriver.Chrome(chromedriver,chrome_options=options)
© www.soinside.com 2019 - 2024. All rights reserved.