我一直在尝试使用python获取网站的一些信息。我曾尝试使用请求和硒来获取网站的HTML代码,但我总是会得到此HTML。我猜该网站意识到不是真正的人在进行搜索,因此拒绝访问。有什么方法可以解决此问题并获取该网站的HTML代码?
<html lang="en"><head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Access to this page has been denied.</title>
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet">
<style>
html, body {
margin: 0;
padding: 0;
font-family: 'Open Sans', sans-serif;
color: #000;
}
a {
color: #c5c5c5;
text-decoration: none;
}
.container {
align-items: center;
display: flex;
flex: 1;
justify-content: space-between;
flex-direction: column;
height: 100%;
}
.container > div {
width: 100%;
display: flex;
justify-content: center;
}
.container > div > div {
display: flex;
width: 80%;
}
.customer-logo-wrapper {
padding-top: 2rem;
flex-grow: 0;
background-color: #fff;
visibility: visible;
}
.customer-logo {
border-bottom: 1px solid #000;
}
.customer-logo > img {
padding-bottom: 1rem;
max-height: 50px;
max-width: 100%;
}
.page-title-wrapper {
flex-grow: 2;
}
.page-title {
flex-direction: column-reverse;
}
.content-wrapper {
flex-grow: 5;
}
.content {
flex-direction: column;
}
.page-footer-wrapper {
align-items: center;
flex-grow: 0.2;
background-color: #000;
color: #c5c5c5;
font-size: 70%;
}
@media (min-width: 768px) {
html, body {
height: 100%;
}
}
</style>
<!-- Custom CSS -->
<link rel="stylesheet" type="text/css" href="https://d33a4decm84gsn.cloudfront.net/static/partners/perimeterx/perimeterx.css">
<script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/zItNOfzbrqVGbb4QFYpPpcrw/recaptcha__es.js"></script><script src="/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0"></script><script src="https://www.recaptcha.net/recaptcha/api.js?hl=es-ES"></script><script src="/Z5wgH7n9/init.js"></script><a tabindex="-1" aria-hidden="true" href="/colleges/yale-university/?_pxhc=1587174500133" rel="nofollow" target="_blank" style="width: 0px; height: 0px; font-size: 0px; line-height: 0;"></a></head>
<body>
<section class="container">
<div class="customer-logo-wrapper">
<div class="customer-logo">
<img src="https://www.niche.com/about/wp-content/themes/niche-about/images/about-home/stacked-green.svg" alt="Logo">
</div>
</div>
<div class="page-title-wrapper">
<div class="page-title">
<h1>Please verify you are a human</h1>
</div>
</div>
<div class="content-wrapper">
<div class="content">
<div id="px-captcha"><div class="g-recaptcha" data-sitekey="6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b" data-callback="handleCaptcha" data-theme="dark"><div style="width: 304px; height: 78px;"><div><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&k=6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b&co=aHR0cHM6Ly93d3cubmljaGUuY29tOjQ0Mw..&hl=es&v=zItNOfzbrqVGbb4QFYpPpcrw&theme=dark&size=normal&cb=19z4nanjwlu" width="304" height="78" role="presentation" name="a-s7me84fdbal4" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div><iframe style="display: none;"></iframe></div></div>
<p>
Access to this page has been denied because we believe you are using automation tools to browse the
website.
</p>
<p>
This may happen as a result of the following:
</p>
<ul>
<li>
Javascript is disabled or blocked by an extension (ad blockers for example)
</li>
<li>
Your browser does not support cookies
</li>
</ul>
<p>
Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking
them from loading.
</p>
<p>
Reference ID: #ad14b320-8116-11ea-9d99-a1ff7eeb44e0
</p>
</div>
</div>
<div class="page-footer-wrapper">
<div class="page-footer">
<p>
Powered by
<a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a>
, Inc.
</p>
</div>
</div>
</section>
<!-- Px -->
<script>
window._pxAppId = 'PXZ5wgH7n9';
window._pxJsClientSrc = '/Z5wgH7n9/init.js';
window._pxFirstPartyEnabled = true;
window._pxVid = '';
window._pxUuid = 'ad14b320-8116-11ea-9d99-a1ff7eeb44e0';
window._pxHostUrl = '/Z5wgH7n9/xhr';
</script>
<script>
var s = document.createElement('script');
s.src = '/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0';
var p = document.getElementsByTagName('head')[0];
p.insertBefore(s, null);
if (true) {
s.onerror = function () {
s = document.createElement('script');
var suffixIndex = '/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0'.indexOf('captcha.js');
var temperedBlockScript = '/Z5wgH7n9/captcha/captcha.js?a=c&u=ad14b320-8116-11ea-9d99-a1ff7eeb44e0&v=&m=0'.substring(suffixIndex);
s.src = '//captcha.px-cdn.net/PXZ5wgH7n9/' + temperedBlockScript;
p.parentNode.insertBefore(s, p);
};
}
</script>
<!-- Custom Script -->
</body></html>
很明显,该网站能够识别您的机器人。由于我不知道您要抓取哪个网站,因此无法确定此特定方法是否有效。
尝试更改用户代理。默认情况下,chromedriver
的用户代理与通常的Chrome浏览器不同。
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36")
driver = webdriver.Chrome(chromedriver,chrome_options=options)