使用动态 wdtNonce 参数抓取网站

问题描述 投票:0回答:1

我在网页抓取方面几乎是自学的,而且我对网页的内部工作原理并没有真正深入的了解。

但是,我已经能够抓取我接触到的所有网站了。

直到我尝试这个

我的目标是能够选择日期并下载相应的价格。

通过检查网络流量,我已经能够复制 HTTP 请求,以 JSON 格式生成所需的响应。

上述请求的有效负载如下所示:

    {
    "draw": "5",
    "columns[0][data]": "0",
    "columns[0][name]": "wdt_ID",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "false",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "1",
    "columns[1][name]": "date",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "false",
    "columns[1][search][value]": "26+Feb+2024|26+Feb+2024",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "2",
    "columns[2][name]": "mtu",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "false",
    "columns[2][search][value]": "|",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "3",
    "columns[3][name]": "almcpmwh",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "false",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "4",
    "columns[4][name]": "alvolumemwh",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "false",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "5",
    "columns[5][name]": "alnetpositionmwh",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "false",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "6",
    "columns[6][name]": "ksmcpmwh",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "false",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "7",
    "columns[7][name]": "ksvolumemwh",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "false",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "columns[8][data]": "8",
    "columns[8][name]": "ksnetpositionmwh",
    "columns[8][searchable]": "true",
    "columns[8][orderable]": "false",
    "columns[8][search][value]": "",
    "columns[8][search][regex]": "false",
    "columns[9][data]": "9",
    "columns[9][name]": "datetime",
    "columns[9][searchable]": "true",
    "columns[9][orderable]": "false",
    "columns[9][search][value]": "|",
    "columns[9][search][regex]": "false",
    "start": "0",
    "length": "25",
    "search[value]": "",
    "search[regex]": "false",
    "sumColumns[]": [
        "alvolumemwh",
        "ksvolumemwh",
        "alnetpositionmwh",
        "ksnetpositionmwh"
    ],
    "avgColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "minColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "maxColumns[]": [
        "almcpmwh",
        "ksmcpmwh"
    ],
    "wdtNonce": "c201b4ccc3"
}

到目前为止一切顺利。一切正常,我可以选择日期并下载我想要的数据。

但是这个参数的值

"wdtNonce": "c201b4ccc3"

似乎是动态的,一段时间后,我使用的默认值将不再有效,并且请求不会返回任何数据。

有办法让它持久吗?

有没有办法自动将参数的值更新为有效值?

有办法绕过这个吗?

我的浏览器如何预先“知道”该参数应使用哪个值?

这是一个旨在阻止抓取的内置功能吗?

我不会发布我的代码,因为代码本身工作没有任何问题。 预先感谢您!

web-scraping web dynamic xmlhttprequest nonce
1个回答
0
投票
Identify the logic behind wdtNonce generation: This might involve inspecting the network traffic or website code to understand how the server generates new wdtNonce values. If you can identify a pattern, you could potentially create a mechanism to generate new nonces when the current one becomes invalid. However, this approach can be fragile and break easily if the website changes its logic.

Check for API availability: While the website might not provide an official API for data access, there might be an undocumented or hidden one. Explore the website's documentation or search online communities to see if anyone has discovered an unofficial API. Using a documented API is always the recommended approach, as it adheres to rate limits and avoids potential security risks.

Respect robots.txt and terms of service: Before attempting any scraping, always check the website's robots.txt file and terms of service. Scraping against their guidelines is unethical and can be illegal. If scraping is not allowed, respect their decision and explore alternative methods of data acquisition.

Consider alternative data sources: If scraping this specific website is not feasible due to ethical or technical reasons, look for alternative sources that provide the data you need. There might be public datasets, government reports, or official APIs from other organizations that offer similar information.
© www.soinside.com 2019 - 2024. All rights reserved.