抓取用 php 构建的电子商务网站

问题描述 投票:0回答:1

webiste 是 Windows 服务器上的 php

尝试了从获取 html 和文本值开始的所有操作,但我需要自动化根据我的要求格式化这些值,例如 product resolutio 和所有功能 *sale by example.com ,但是当 html 或文本与 python 一起使用时beautifullscoup 它显示错误,因为 html 无法获取我需要的每个参数,

已经在下面尝试过,产品列表太大,无法手动获取,我需要自动化: _ 打开 web scrapper chrome 扩展 _ httrack(附加其获取的文件) _wget _ 使用 Beautifullsoup 的 python 脚本 _ 所有带有 scrapper 的 chrome 扩展 但网站有验证码,可以像附加的 html 中那样停止抓取

`<!-- Mirrored from example.com/sizmodlist_cn.php?sizes[]=5580 by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 01 Apr 2024 15:41:32 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8" /><!-- /Added by HTTrack -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<title>您的电脑有异常访问,请输入以下验证码后继续访问!</title>
<meta name="keywords" content="" />
<meta name="description" content="" />
<meta name="author" content="屏库-全球显示屏交易中心" />
<link rel="icon" href="favicon.ico" type="image/x-icon" />
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon" />
<script src="cms-includes/js/jquery1111.js" language="javascript"></script>
<script src="../www.example.com/cms-includes/js/too.js" language="javascript"></script>
<style>
* { margin:0; padding:0; }
a { color:#333; text-decoration:none}
a:hover { color:#c4261d; text-decoration:underline}
a img { border:none}
a.more { float:right; font-weight:normal; color:#999}

ul, li { list-style-type:none}

.clearfix:after { content:"."; display:block; height:0; clear:both; visibility:hidden; }
.clearfix { display:inline-block; }
/*\*/ .clearfix {display:block;} /**/

body { font-size:14px; color:#333; text-align:center; background:#fafafa; padding-top:100px;}


.vcode { width:460px; margin:auto; text-align:left;}


.vcode .title { background:#f00 url(images/vcode_icon.gif) no-repeat left center; height:34px; line-height:34px; padding-left:40px; color:#FFF; font-size:16px;}
.vcode table {line-height:24px; border:dotted #ddd 2px; margin-top:25px;}
.vcode td { line-height:24px; padding:30px 50px 30px 50px; font-size:12px; background:#FFF}
.vcode .inpu1 { float:left; padding:2px 0 1px 6px;width:100px; height:18px;}
.vcode .img { float:left; width:100px; height:40px;}
.vcode .txt { float:left}
.vcode .but { padding:8px; min-width:140px;}



#validate {
  height: 40px;
  position: relative;
  background-color: #e8e8e8;
  overflow: hidden;
  text-align: center;
  user-select: none;
  -moz-user-select: none;
  -webkit-user-select: none;
}

#validate_bg {
  position: absolute;
  left: 0;
  top: 0;
  height: 100%;
  background-color: #7AC23C;
  z-index: 1;
}

#validate_label {
  width: 38px;
  position: absolute;
  left: 0;
  top: 0;
  height: 38px;
  line-height: 38px;
  border: 1px solid #a7a6aa;
  background: #fff url(cms-includes/validate/icon.gif) no-repeat 5px 5px;
  z-index: 3;
  cursor: move;
}

.validate_right {background: #fff url(cms-includes/validate/icon.gif) no-repeat 5px -65px !important;}

#validate_labelTip {
  position: absolute;
  left: 0;
  width: 100%;
  height: 100%;
  font-size: 14px;
  font-family: 'Microsoft Yahei', serif;
  color: #666;
  line-height: 40px;
  text-align: center;
  z-index: 2;
}
#validate_labelTip a { text-decoration:none;
-webkit-user-select:none;

   -moz-user-select:none;

   -ms-user-select:none;

   user-select:none;
}
.btn-disabled { background:#fbfbfb; border:solid #e6e6e6 1px; color:#999; border-radius:3px; height:36px; line-height:36px; padding:0 35px 0 35px;}
.btn-cur { background:#7ac23c; border:solid #529a14 1px; color:#FFF; border-radius:3px; height:36px; line-height:36px; padding:0 35px 0 35px;}
</style>
<script>

function check_page_code()
{
    //document.getElementById("buttom_code").disabled = true;
    var vcode = $("#lbvcode").val();
    $.get("ajax_neb630.html?ac=check_page_code", { vcode:vcode, rst:randomstr()}, function(data)
    {
        if (data == '100')
        {
            alert("您输入的验证码有误!");
            //document.getElementById("buttom_code").disabled = false;
        }
        else
        {
            //alert("验证码正确!");
            location.reload() ;
            //window.history.back(-1);
        }
    }); 
    
    return false;
}
</script>
</head>
<body>
    
    <form action="https://example.com/sizmodlist_cn.php?" method="get" onsubmit="return check_page_code()">
    <div class="vcode">
        <div class="title">您的电脑有异常访问,请完成验证后继续访问!</div>
        <table width="100%" border="0" cellspacing="0" cellpadding="0">
            <tr>
                <td class="clearfix">

                    
                <script type="text/javascript" src="cms-includes/validate/jquery.slideunlock23b5.js?s=2"></script> 
                
                
                <div id="validate">
                    <div id="validate_bg"></div>
                    <span id="validate_label"></span> <span id="validate_labelTip"><a id="vcode_a" href="javascript:;">请按住滑块,拖动到最右边</a></span>
                </div>
                
                <input type="hidden" value="4440" id="lbvcode" name="vcode" />
                
                <script>
                $(function () {
                    var slider = new SliderUnlock("#validate",{
                            successLabelTip : "验证成功"
                        },function(){
                            $("#validate_label").addClass("validate_right");
                            $("#buttom_code").attr("disabled",false);
                            $("#buttom_code").addClass("btn-cur");
                            //check_vcode();
                        });
                    slider.init();
                })
                </script> 
                    
                    
                    
                </td>
            </tr>
            <tr>
                <td style="padding-top:0px; text-align:right"><input id="buttom_code" disabled="disabled" class="btn-disabled" type="button" onclick="check_page_code()" name="button" value="确认"  /></td>
            </tr>
        </table>
    </div>
    </form>
    
</body>

<!-- Mirrored from example.com/sizmodlist_cn.php?sizes[]=5580 by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 01 Apr 2024 15:41:32 GMT -->
</html>`

`

需要一些有关如何绕过 recapthcha 的帮助

web-scraping screen-scraping scrapinghub robotic-scraping
1个回答
0
投票

在处理验证码和复杂的站点结构时,基本工具可能无法解决问题。您需要更复杂的方法。考虑使用 Selenium 和 python 进行网页抓取;它可以自动执行浏览器操作,模仿人类交互。这种方法可以帮助绕过验证码并动态加载内容,从而更容易准确地抓取数据。对于数据的重新格式化和解析,继续使用 BeautifulSoup 但与 Selenium 结合使用。请记住,请务必检查网站的

robots.txt
,以确保遵守其抓取政策,以避免法律问题。

© www.soinside.com 2019 - 2024. All rights reserved.