我想使用 Selenium 和 python 来捕获我正在浏览的网站的流量,因为使用代理的流量将是 https 不会让我走得太远。
我的想法是使用 selenium 运行 phantomJS 并使用 phantomJS 执行脚本(不是在使用 webdriver.execute_script() 的页面上,而是在 phantomJS 本身上)。我正在考虑 netlog.js 脚本(从这里https://github.com/ariya/phantomjs/blob/master/examples/netlog.js)。
因为它在命令行中是这样工作的
phantomjs --cookies-file=/tmp/foo netlog.js https://google.com
一定有类似的方法可以用硒来做到这一点?
提前致谢
更新:
用 browsermob-proxy 解决了它。
pip3 install browsermob-proxy
Python3代码
from selenium import webdriver
from browsermobproxy import Server
server = Server(<path to browsermob-proxy>)
server.start()
proxy = server.create_proxy({'captureHeaders': True, 'captureContent': True, 'captureBinaryContent': True})
service_args = ["--proxy=%s" % proxy.proxy, '--ignore-ssl-errors=yes']
driver = webdriver.PhantomJS(service_args=service_args)
proxy.new_har()
driver.get('https://google.com')
print(proxy.har) # this is the archive
# for example:
all_requests = [entry['request']['url'] for entry in proxy.har['log']['entries']]
我为此使用代理
from selenium import webdriver
from browsermobproxy import Server
server = Server(environment.b_mob_proxy_path)
server.start()
proxy = server.create_proxy()
service_args = ["--proxy-server=%s" % proxy.proxy]
driver = webdriver.PhantomJS(service_args=service_args)
proxy.new_har()
driver.get('url_to_open')
print proxy.har # this is the archive
# for example:
all_requests = [entry['request']['url'] for entry in proxy.har['log']['entries']]
“har”(http 存档格式)有很多有关请求和响应的其他信息,对我来说非常有用
在 Linux 上安装:
pip install browsermob-proxy
如果这里有人正在寻找纯 Selenium/Python 解决方案,以下代码片段可能会有所帮助。它使用 Chrome 记录所有请求,并打印所有 json 请求及其相应的响应。
from time import sleep
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
# make chrome log requests
capabilities = DesiredCapabilities.CHROME
capabilities["loggingPrefs"] = {"performance": "ALL"} # chromedriver < ~75
# capabilities["goog:loggingPrefs"] = {"performance": "ALL"} # chromedriver 75+
driver = webdriver.Chrome(
desired_capabilities=capabilities, executable_path="./chromedriver"
)
# fetch a site that does xhr requests
driver.get("https://sitewithajaxorsomething.com")
sleep(5) # wait for the requests to take place
# extract requests from logs
logs_raw = driver.get_log("performance")
logs = [json.loads(lr["message"])["message"] for lr in logs_raw]
def log_filter(log_):
return (
# is an actual response
log_["method"] == "Network.responseReceived"
# and json
and "json" in log_["params"]["response"]["mimeType"]
)
for log in filter(log_filter, logs):
request_id = log["params"]["requestId"]
resp_url = log["params"]["response"]["url"]
print(f"Caught {resp_url}")
print(driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id}))
要点:https://gist.github.com/lorey/079c5e178c9c9d3c30ad87df7f70491d
我为此使用了没有代理服务器的解决方案。我根据下面的链接修改了selenium源代码以添加executePhantomJS功能。
https://github.com/SeleniumHQ/selenium/pull/2331/files
然后我在获取phantomJS驱动程序后执行以下脚本:
from selenium.webdriver import PhantomJS
driver = PhantomJS()
script = """
var page = this;
page.onResourceRequested = function (req) {
console.log('requested: ' + JSON.stringify(req, undefined, 4));
};
page.onResourceReceived = function (res) {
console.log('received: ' + JSON.stringify(res, undefined, 4));
};
"""
driver.execute_phantomjs(script)
driver.get("http://ariya.github.com/js/random/")
driver.quit()
然后所有请求都记录在控制台中(通常是ghostdriver.log文件)
希望这对某人有帮助,但我正在使用树莓派铬网络驱动程序来运行硒,并且在设置功能时遇到困难。这就是最终对我有用的东西,让我看到正在发生的所有网络日志
CHROME_DRIVER_PATH = "/usr/lib/chromium-browser/chromedriver"
# start browser
options = webdriver.ChromeOptions()
options.set_capability('goog:loggingPrefs', {"performance": "ALL"})
service = Service(executable_path=CHROME_DRIVER_PATH, log_output=subprocess.STDOUT)
driver = webdriver.Chrome(service=service, options=options)
...
perf_log = driver.get_log('performance')
def process_browser_log_entry(entry):
response = json.loads(entry['message'])['message']
return response
# this can be whatever you're looking for in the message, but bear in mind it is a string
events = [process_browser_log_entry(entry) for entry in perf_log if 'Network.response' in entry['message']]
...
# once your find the event you want (assuming you converted it to a dict
driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': event["params"]["requestId"]})