与使用 CHROME 相比,使用 CURL 从 Marketwatch.com 抓取页面时的 html 源代码不同

问题描述 投票:0回答:1

当我使用 CURL(窗口命令提示符)从 https://www.marketwatch.com/tools/screener/market?exchange=nasdaq&subreport=largestpercentgainreport 抓取 HTML 源代码时,它返回的 HTML 源代码与我在 CHROME 中使用“查看页面源代码”。我不确定 MarketWatch 是否使用动态页面或什么。直到一周前,CURL 才与 MarketWatch.com 完美配合。

我尝试使用代理标头、带有 CHROME Web 驱动程序的 Python Selenium 等,但仍然获得“不是最新的”HTML 源代码。

selenium-webdriver curl dynamic webdriver
1个回答
0
投票

错误必须出现在您的代码中,您需要模仿浏览器标头、cookie 等才能获得相同的结果,否则网站将检测到请求无效并阻止它

我已经编写了从您提供的网站中抓取数据的代码

import requests

session = requests.session()

url = "https://www.marketwatch.com:443/tools/screener/market?exchange=nasdaq&subreport=largestpercentgainreport"
cookie = {"refresh": "off", "letsGetMikey": "enabled", "mw_loc": "%7B%22Region%22%3A%22UP%22%2C%22Country%22%3A%22IN%22%2C%22Continent%22%3A%22AS%22%2C%22ApplicablePrivacy%22%3A0%7D", "gdprApplies": "false", "ab_uuid": "8fa3f2c4-b2d3-4625-9f96-fd4b7e110f2b", "fullcss-tools": "tools-f8d1c41686.min.css", "icons-loaded": "true", "dnsDisplayed": "undefined", "ccpaApplies": "false", "signedLspa": "undefined", "_pubcid": "3bb84c02-3387-4891-a942-e50396310063", "_pubcid_cst": "kSylLAssaw%3D%3D", "_sp_su": "false", "_lr_geo_location_state": "UP", "_lr_geo_location": "IN", "utag_main": "v_id:018dec2c0837001eac143645f8c102074006406c00bd0$_sn:1$_ss:1$_st:1709066125181$ses_id:1709064325181%3Bexp-session$_pn:1%3Bexp-session$_prevpage:MW_Market%20Screener%3Bexp-1709067925206$vapi_domain:marketwatch.com", "ccpaUUID": "11791f26-e578-4ca6-b5a3-600e241542e9", "AMCVS_CB68E4BA55144CAA0A4C98A5%40AdobeOrg": "1", "s_tp": "2299", "s_ppv": "MW_Market%2520Screener%2C41%2C41%2C949", "s_cc": "true", "_pcid": "%7B%22browserId%22%3A%22lt4sru756ntfpymn%22%7D", "cX_P": "lt4sru756ntfpymn", "_pctx": "%7Bu%7DN4IgrgzgpgThIC4B2YA2qA05owMoBcBDfSREQpAeyRCwgEt8oBJAEzIE4AmHgZi4CsvAIwB2DqIAMADkHTRvEAF8gA", "ajs_anonymous_id": "a7d0fb6a-d098-420d-8703-20f58f2ea9f5", "_fbp": "fb.1.1709064327744.1251317668", "_meta_facebookTag_sync": "1709064327745", "_ncg_sp_ses.f57d": "*", "_ncg_domain_id_": "ff845496-85ea-48cf-bdce-556c76ce800c.1.1709064326416.1772136326416", "_ncg_id_": "0c9448a2-57b0-4f55-aead-10cd2a426805", "_dj_ses.cff7": "*", "_dj_id.cff7": ".1709064328.1.1709064328.1709064328.978ba062-eff6-4eb0-89b9-8df2cc49491a", "AMCV_CB68E4BA55144CAA0A4C98A5%40AdobeOrg": "1585540135%7CMCIDTS%7C19781%7CMCMID%7C35653410761297282722755192239542865746%7CMCAID%7CNONE%7CMCOPTOUT-1709071526s%7CNONE%7CMCAAMLH-1709669126%7C12%7CMCAAMB-1709669126%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCSYNCSOP%7C411-19788%7CvVersion%7C4.4.0", "_gcl_au": "1.1.783692186.1709064328", "_fbp": "fb.1.1709064327744.1251317668", "_ncg_sp_id.f57d": "0c9448a2-57b0-4f55-aead-10cd2a426805.1709064328.1.1709064329..18968c22-4fac-429b-b3a1-40fcc16b4262..391a8aec-2454-4583-9d49-91e1c11e2adb.1709064327953.2", "_rdt_uuid": "1709064329153.db192c38-bb98-4ff6-85f3-ceb992fc2520", "_parsely_session": "{%22sid%22:1%2C%22surl%22:%22https://www.marketwatch.com/tools/screener/market?exchange=nasdaq&subreport=largestpercentgainreport%22%2C%22sref%22:%22%22%2C%22sts%22:1709064329294%2C%22slts%22:0}", "_parsely_visitor": "{%22id%22:%22pid=8df694d3-e55f-4bf2-b5b0-dec237b1f3c4%22%2C%22session_count%22:1%2C%22last_session_ts%22:1709064329294}", "_ncg_g_id_": "63a0184e-16f8-4fd1-9c8f-99ba8d2aaa5b.1.1709064329.1772136326416", "_dj_sp_id": "29d97fc4-58db-4cef-80de-2a37073ff3dd", "cX_G": "cx%3A27k3fauluigsv130jvlcsml7f7%3Akkyk484n7g5h"}
header = {"Sec-Ch-Ua": "\"Chromium\";v=\"121\", \"Not A(Brand\";v=\"99\"", "Sec-Ch-Ua-Mobile": "?0", "Sec-Ch-Ua-Platform": "\"Linux\"", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.160 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", "Sec-Fetch-Site": "none", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-User": "?1", "Sec-Fetch-Dest": "document", "Accept-Encoding": "gzip, deflate, br", "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8", "Priority": "u=0, i"}
resp = session.get(url, headers=header, cookies=cookie)
print(resp.text)

渲染

response.text
会产生与https://www.marketwatch.com/tools/screener/market?exchange=nasdaq&subreport=largestpercentgainreport相同的页面,所以我相信它工作正常

© www.soinside.com 2019 - 2024. All rights reserved.