使用python-Requests / urllib3 /或selenium模块获取多个Web URL的状态代码

问题描述 投票:1回答:1

我正在尝试编写一个python脚本来获取~200个URL的HTTP状态代码和响应。最终输出是以带有ULR名称的html格式显示这些详细信息,并显示状态代码,响应消息,错误(如果有)以及页面的屏幕截图。我已经尝试使用请求和urllib模块来开发这个脚本,但是如果发生任何HTTPException,我的代码会中断,而不会捕获该特定URL的状态代码和响应消息。作为替代解决方案,我开发了另一个带有selenium模块的Python脚本,其中我正在尝试捕获URL的性能日志,特别是“Network.responseReceived”。

from selenium import webdriver
from datetime import datetime
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

# enable browser logging
d = DesiredCapabilities.CHROME
d['loggingPrefs'] = { 'performance':'ALL' }
options = webdriver.ChromeOptions()  
options.add_argument("--headless")  
driver = webdriver.Chrome(chrome_options=options, executable_path="C:\\chromedriver_win32\\chromedriver.exe")
#driver = webdriver.Ie(executable_path="C:\\IE_driver\\MicrosoftWebDriver.exe")
driver.get("https://www.google.com")
#driver.get('https://www.google.com/nonexistant')

print(driver.title)
performance_log = driver.get_log('performance')

for entry in performance_log:
    print(type(entry))
    print (entry)
    print("================================================")
    print(" ")
    print(" ")

driver.close()

以下是我得到的输出。

Google
<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.loadingFinished","params":{"encodedDataLength":0,"requestId":"D99D380DD024B8928B5EAAC76E447956","shouldReportCorbBlocking":false,"timestamp":528401.402473}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228343}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.frameNavigated","params":{"frame":{"id":"8DBAE0AE8594201DC3D129C819A696C8","loaderId":"D99D380DD024B8928B5EAAC76E447956","mimeType":"text/plain","securityOrigin":"://","url":"data:,"}}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228343}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.loadEventFired","params":{"timestamp":528401.409908}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228344}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.frameStoppedLoading","params":{"frameId":"8DBAE0AE8594201DC3D129C819A696C8"}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228346}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Page.domContentEventFired","params":{"timestamp":528401.41067}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228347}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.requestWillBeSent","params":{"documentURL":"https://www.google.com/","frameId":"8DBAE0AE8594201DC3D129C819A696C8","hasUserGesture":false,"initiator":{"type":"other"},"loaderId":"16D0090B144D4D0D6DB68B993CE5DE12","request":{"headers":{"Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/72.0.3626.109 Safari/537.36"},"initialPriority":"VeryHigh","method":"GET","mixedContentType":"none","referrerPolicy":"no-referrer-when-downgrade","url":"https://www.google.com/"},"requestId":"16D0090B144D4D0D6DB68B993CE5DE12","timestamp":528401.455107,"type":"Document","wallTime":1554297228.37452}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297228378}
================================================


<class 'dict'>
{'level': 'INFO', 'message': '{"message":{"method":"Network.responseReceived","params":{"frameId":"8DBAE0AE8594201DC3D129C819A696C8","loaderId":"16D0090B144D4D0D6DB68B993CE5DE12","requestId":"16D0090B144D4D0D6DB68B993CE5DE12","response":{"connectionId":17,"connectionReused":false,"encodedDataLength":6681,"fromDiskCache":false,"fromServiceWorker":false,"headers":{"alt-svc":"quic=\\":443\\"; ma=2592000; v=\\"46,44,43,39\\"","cache-control":"private, max-age=0","content-encoding":"gzip","content-length":"65219","content-type":"text/html; charset=UTF-8","date":"Wed, 03 Apr 2019 13:13:52 GMT","expires":"-1","p3p":"CP=\\"This is not a P3P policy! See g.co/p3phelp for more info.\\"","server":"gws","set-cookie":"1P_JAR=2019-04-03-13; expires=Fri, 03-May-2019 13:13:52 GMT; path=/; domain=.google.com\\nNID=180=fV81eC5C8adCVzltTPlJnIxiDUi4bSEzqRVHIQwx7z5S75opd6k3fmtLeGNOllEqRlpcQ-X31RSveq0FgdL5e0GBcVZxYZjzI9g2Bgn_Wepj5RfErPoo5re54HFO-sgiXV5vqNftY7JHm60YxVYQXJqp9HhpdbpB0cJ3HLOCguo; expires=Thu, 03-Oct-2019 13:13:52 GMT; path=/; domain=.google.com; HttpOnly","status":"200","x-frame-options":"SAMEORIGIN","x-xss-protection":"0"},"mimeType":"text/html","protocol":"h2","remoteIPAddress":"172.217.168.196","remotePort":443,"requestHeaders":{":authority":"www.google.com",":method":"GET",":path":"/",":scheme":"https","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","accept-encoding":"gzip, deflate, br","upgrade-insecure-requests":"1","user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/72.0.3626.109 Safari/537.36"},"securityDetails":{"certificateId":0,"certificateTransparencyCompliance":"unknown","cipher":"AES_128_GCM","issuer":"Google Internet Authority G3","keyExchange":"","keyExchangeGroup":"X25519","protocol":"TLS 1.3","sanList":["www.google.com"],"signedCertificateTimestampList":[],"subjectName":"www.google.com","validFrom":1551433595,"validTo":1558689900},"securityState":"secure","status":200,"statusText":"","timing":{"connectEnd":3683.223,"connectStart":2467.054,"dnsEnd":2467.054,"dnsStart":2352.226,"proxyEnd":2351.998,"proxyStart":86.464,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":3976.231,"requestTime":528401.456284,"sendEnd":3687.307,"sendStart":3685.241,"sslEnd":3683.104,"sslStart":2620.349,"workerReady":-1,"workerStart":-1},"url":"https://www.google.com/"},"timestamp":528405.434789,"type":"Document"}},"webview":"8DBAE0AE8594201DC3D129C819A696C8"}', 'timestamp': 1554297232388}
================================================



我需要解析Network.responseReceived细节,因为它具有所有必需的细节。那么我该怎么做才能解析Network.responseReceived日志中的细节。

python-3.x selenium python-requests httpresponse http-status-codes
1个回答
0
投票

将每个"message"entry密钥转换为python dict,并提取所需的属性。

在脚本的开头,添加json库的导入;然后,在performance_log循环内:

for entry in performance_log:
    message = json.loads(entry['message'])

现在变量message将是一个普通的python字典,你可以从中获得所需的任何属性。例如,这是状态代码:

print(message['message']['params']['response']['status'])

这是目标网址:

print(message['message']['params']['response']['url'])

请记住,您将获得浏览器/ html创建的每个资源请求的条目 - 您可能只想过滤到最顶层/域名的请求。

© www.soinside.com 2019 - 2024. All rights reserved.