webiste 是 Windows 服务器上的 php
尝试了从获取 html 和文本值开始的所有操作,但我需要自动化根据我的要求格式化这些值,例如 product resolutio 和所有功能 *sale by example.com ,但是当 html 或文本与 python 一起使用时beautifullscoup 它显示错误,因为 html 无法获取我需要的每个参数,
已经在下面尝试过,产品列表太大,无法手动获取,我需要自动化: _ 打开 web scrapper chrome 扩展 _ httrack(附加其获取的文件) _wget _ 使用 Beautifullsoup 的 python 脚本 _ 所有带有 scrapper 的 chrome 扩展 但网站有验证码,可以像附加的 html 中那样停止抓取
`<!-- Mirrored from example.com/sizmodlist_cn.php?sizes[]=5580 by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 01 Apr 2024 15:41:32 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8" /><!-- /Added by HTTrack -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<title>您的电脑有异常访问,请输入以下验证码后继续访问!</title>
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="屏库-全球显示屏交易中心" />
<link rel="icon" href="favicon.ico" type="image/x-icon" />
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon" />
<script src="cms-includes/js/jquery1111.js" language="javascript"></script>
<script src="../www.example.com/cms-includes/js/too.js" language="javascript"></script>
<style>
* { margin:0; padding:0; }
a { color:#333; text-decoration:none}
a:hover { color:#c4261d; text-decoration:underline}
a img { border:none}
a.more { float:right; font-weight:normal; color:#999}
ul, li { list-style-type:none}
.clearfix:after { content:"."; display:block; height:0; clear:both; visibility:hidden; }
.clearfix { display:inline-block; }
/*\*/ .clearfix {display:block;} /**/
body { font-size:14px; color:#333; text-align:center; background:#fafafa; padding-top:100px;}
.vcode { width:460px; margin:auto; text-align:left;}
.vcode .title { background:#f00 url(images/vcode_icon.gif) no-repeat left center; height:34px; line-height:34px; padding-left:40px; color:#FFF; font-size:16px;}
.vcode table {line-height:24px; border:dotted #ddd 2px; margin-top:25px;}
.vcode td { line-height:24px; padding:30px 50px 30px 50px; font-size:12px; background:#FFF}
.vcode .inpu1 { float:left; padding:2px 0 1px 6px;width:100px; height:18px;}
.vcode .img { float:left; width:100px; height:40px;}
.vcode .txt { float:left}
.vcode .but { padding:8px; min-width:140px;}
#validate {
height: 40px;
position: relative;
background-color: #e8e8e8;
overflow: hidden;
text-align: center;
user-select: none;
-moz-user-select: none;
-webkit-user-select: none;
}
#validate_bg {
position: absolute;
left: 0;
top: 0;
height: 100%;
background-color: #7AC23C;
z-index: 1;
}
#validate_label {
width: 38px;
position: absolute;
left: 0;
top: 0;
height: 38px;
line-height: 38px;
border: 1px solid #a7a6aa;
background: #fff url(cms-includes/validate/icon.gif) no-repeat 5px 5px;
z-index: 3;
cursor: move;
}
.validate_right {background: #fff url(cms-includes/validate/icon.gif) no-repeat 5px -65px !important;}
#validate_labelTip {
position: absolute;
left: 0;
width: 100%;
height: 100%;
font-size: 14px;
font-family: 'Microsoft Yahei', serif;
color: #666;
line-height: 40px;
text-align: center;
z-index: 2;
}
#validate_labelTip a { text-decoration:none;
-webkit-user-select:none;
-moz-user-select:none;
-ms-user-select:none;
user-select:none;
}
.btn-disabled { background:#fbfbfb; border:solid #e6e6e6 1px; color:#999; border-radius:3px; height:36px; line-height:36px; padding:0 35px 0 35px;}
.btn-cur { background:#7ac23c; border:solid #529a14 1px; color:#FFF; border-radius:3px; height:36px; line-height:36px; padding:0 35px 0 35px;}
</style>
<script>
function check_page_code()
{
//document.getElementById("buttom_code").disabled = true;
var vcode = $("#lbvcode").val();
$.get("ajax_neb630.html?ac=check_page_code", { vcode:vcode, rst:randomstr()}, function(data)
{
if (data == '100')
{
alert("您输入的验证码有误!");
//document.getElementById("buttom_code").disabled = false;
}
else
{
//alert("验证码正确!");
location.reload() ;
//window.history.back(-1);
}
});
return false;
}
</script>
</head>
<body>
<form action="https://example.com/sizmodlist_cn.php?" method="get" onsubmit="return check_page_code()">
<div class="vcode">
<div class="title">您的电脑有异常访问,请完成验证后继续访问!</div>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="clearfix">
<script type="text/javascript" src="cms-includes/validate/jquery.slideunlock23b5.js?s=2"></script>
<div id="validate">
<div id="validate_bg"></div>
<span id="validate_label"></span> <span id="validate_labelTip"><a id="vcode_a" href="javascript:;">请按住滑块,拖动到最右边</a></span>
</div>
<input type="hidden" value="4440" id="lbvcode" name="vcode" />
<script>
$(function () {
var slider = new SliderUnlock("#validate",{
successLabelTip : "验证成功"
},function(){
$("#validate_label").addClass("validate_right");
$("#buttom_code").attr("disabled",false);
$("#buttom_code").addClass("btn-cur");
//check_vcode();
});
slider.init();
})
</script>
</td>
</tr>
<tr>
<td style="padding-top:0px; text-align:right"><input id="buttom_code" disabled="disabled" class="btn-disabled" type="button" onclick="check_page_code()" name="button" value="确认" /></td>
</tr>
</table>
</div>
</form>
</body>
<!-- Mirrored from example.com/sizmodlist_cn.php?sizes[]=5580 by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 01 Apr 2024 15:41:32 GMT -->
</html>`
`
需要一些有关如何绕过 recapthcha 的帮助
在处理验证码和复杂的站点结构时,基本工具可能无法解决问题。您需要更复杂的方法。考虑使用 Selenium 和 python 进行网页抓取;它可以自动执行浏览器操作,模仿人类交互。这种方法可以帮助绕过验证码并动态加载内容,从而更容易准确地抓取数据。对于数据的重新格式化和解析,继续使用 BeautifulSoup 但与 Selenium 结合使用。请记住,请务必检查网站的
robots.txt
,以确保遵守其抓取政策,以避免法律问题。