如何使用 python 保存网站的所有网络流量(请求和响应标头)

问题描述 投票:0回答:2

我正在尝试查找在网站加载期间下载到浏览器中的对象。

这是网站,https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en,

我不太擅长网络技术等。

我正在尝试仅使用网站链接来保存请求和响应标头以及实际响应。

如果查看网络流量,您可以看到一个对象

jobsearch.ftl?lang=en
在最后加载,并且您可以看到响应和标头。

这是显示请求和响应标头的网络事件日志的屏幕截图。

以及实际反应。

这些是我要保存的对象。我怎样才能做到这一点?

我已经尝试过了

import json
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, StaleElementReferenceException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

chromepath = "~/chromedriver/chromedriver"

caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(executable_path=chromepath, desired_capabilities=caps)
driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

def process_browser_log_entry(entry):
    response = json.loads(entry['message'])['message']
    return response

browser_log = driver.get_log('performance') 
events = [process_browser_log_entry(entry) for entry in browser_log]
events = [event for event in events if 'Network.response' in event['method']]

但是我只得到了一些标题,它们看起来像这样,


{'method': 'Network.responseReceivedExtraInfo',
  'params': {'blockedCookies': [],
   'headers': {'Cache-Control': 'private',
    'Connection': 'Keep-Alive',
    'Content-Encoding': 'gzip',
    'Content-Security-Policy': "frame-ancestors 'self'",
    'Content-Type': 'text/html;charset=UTF-8',
    'Date': 'Mon, 27 Sep 2021 18:18:10 GMT',
    'Keep-Alive': 'timeout=5, max=100',
    'P3P': 'CP="CAO PSA OUR"',
    'Server': 'Taleo Web Server 8',
    'Set-Cookie': 'locale=en; path=/careersection/; secure; HttpOnly',
    'Transfer-Encoding': 'chunked',
    'Vary': 'Accept-Encoding',
    'X-Content-Type-Options': 'nosniff',
    'X-UA-Compatible': 'IE=edge',
    'X-XSS-Protection': '1'},
   'headersText': 'HTTP/1.1 200 OK\r\nDate: Mon, 27 Sep 2021 18:18:10 GMT\r\nServer: Taleo Web Server 8\r\nCache-Control: private\r\nP3P: CP="CAO PSA OUR"\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Content-Type-Options: nosniff\r\nSet-Cookie: locale=en; path=/careersection/; secure; HttpOnly\r\nContent-Security-Policy: frame-ancestors \'self\'\r\nX-XSS-Protection: 1\r\nX-UA-Compatible: IE=edge\r\nKeep-Alive: timeout=5, max=100\r\nConnection: Keep-Alive\r\nTransfer-Encoding: chunked\r\nContent-Type: text/html;charset=UTF-8\r\n\r\n',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'resourceIPAddressSpace': 'Public'}},
 {'method': 'Network.responseReceived',
  'params': {'frameId': '1624E6F3E724CA508A6D55D556CBE198',
   'loaderId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'requestId': '1E3CDDE80EE37825EF2D9C909FFFAFF3',
   'response': {'connectionId': 26,

它们不包含我可以从 Chrome 中的 Web 检查器看到的所有信息。

我想获取整个响应和请求标头以及实际响应。这是正确的方法吗?有没有另一种更好的方法,不使用硒,只使用请求?

python json selenium web-scraping
2个回答
3
投票

如果您想使用

selenium-wire
来处理此问题,可以使用
Selenium
库。但是,如果您只关心特定的 API,那么您可以使用
requests
库来访问 API,然后打印
request
response
标头的结果,而不是使用 Selenium。

考虑到您正在寻找更早的、使用 Selenium 的方式,实现此目的的一种方法是使用

selenium-wire
库。但是,它会给出所有被命中的后台 API/请求的结果 - 然后您可以在将结果通过管道传输到文本文件或终端本身后轻松过滤该结果

使用

pip install selenium-wire

安装

使用

webdriver-manager
 安装 
pip install webdriver-manager

使用

pip install selenium==4.0.0.b4

安装 Selenium 4

使用此代码

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

svc    = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=svc)

driver.get('https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en')

for req in driver.requests:
  if req.response:
    print(
      req.url,
      req.response.status_code,
      req.headers,
      req.response.headers
    )

它给出了所有请求的详细输出 - 复制相关的 -

https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en 200 


Host: epco.taleo.net
Connection: keep-alive
sec-ch-ua: "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8


Date: Tue, 28 Sep 2021 11:14:14 GMT
Server: Taleo Web Server 8
Cache-Control: private
P3P: CP="CAO PSA OUR"
Content-Encoding: gzip
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Set-Cookie: locale=en; path=/careersection/; secure; HttpOnly
Content-Security-Policy: frame-ancestors 'self'
X-XSS-Protection: 1
X-UA-Compatible: IE=edge
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8

1
投票

可以在selenium中使用JS。所以这会更容易:

var req = new XmlHttpRequest();
req.open("get", url_address_string);
req.send();
// when you get your data then:
x.getAllResponseHeaders();

XmlHttpRequest 是异步的,因此您需要一些代码来使用答案。

好的,开始吧:

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
headers = driver.execute_script(""" var xhr = new XMLHttpRequest();
var rH;

xhr.addEventListener('loadend', (ev) => {
     window.rH =  xhr.getAllResponseHeaders(); // <-- we assing headers to proprty in window object so then we can use it
    console.log(rH);
    return rH;
})

xhr.open("get", "https://stackoverflow.com/")
xhr.send()
""")
# need to wait bcoz xhr request is async, this is dirty don't do this ;)
time.sleep(5)
# and we now can extract our 'rH' property from window. With javascript
headers = driver.execute_script("""return window.rH""")
# <--- "accept-ranges: bytes\r\ncache-control: private\r\ncontent-encoding: gzip\r\ncontent-security-policy: upgrade-insecure-requests; ....
print(headers)
# headers arejust string but parts of it are separated with \r\n so you need to
# headers.split("\r\n")
# then you will find a list

编辑2: 你实际上不需要标题。当您的浏览器转到所需的网址时,响应之一会为此页面创建变量:

_ftl

当您打开开发工具 -> 控制台并输入“_ftl”时,您将看到对象。现在您想要访问它。但这并不那么容易 - _ftl 是深层嵌套对象。所以你必须选择它的属性并尝试访问。喜欢:

a = driver.execute_script("return window._ftl._acts")
将导致:

但是访问数据将是一项艰巨的任务,

_ftl
是嵌套对象,selenium js序列化器无法自动处理它。

所以另一个答案:

import requests
from bs4 import BeautifulSoup

url = "https://epco.taleo.net/careersection/alljobs/jobsearch.ftl?lang=en"

g = requests.get(url)

soup = BeautifulSoup(g.text)
ftl_script = soup.find_all('script')[-1]
data_you_need =ftl_script.text

但这会产生原始字符串。你还得想办法处理它。

© www.soinside.com 2019 - 2024. All rights reserved.