我正在创建一个机器人,它将在 EdX 上浏览一些视频并下载它们,在每个响应中,我都会检查标题以确定内容是否是视频,以及我尝试过的检查响应的方法用途是:
async enableDownload(): Promise<void> {
this.page.on("response", async (response: HTTPResponse) => {
const url = new URL(response.url());
const headers = response.headers();
// Checking if the video is loaded in chunks
if (headers["transfer-encoding"] === "chunked") {
console.log("Chunked transfer detected for URL:", response.url());
// Optionally, you can try to read some of the response body to inspect it:
const text = await response.text();
console.log(text)
}
// Checking the length in order to find big files
if (headers["content-length"]) {
const contentLength = parseInt(headers["content-length"], 10);
// Define a threshold, e.g., 5 MB
console.log(contentLength);
if (contentLength > 100000) {
console.log("Large file detected:", response.url());
}
}
// Checking for the extension of the file
const pathname = url.pathname;
const videoExtensions = [".mp4", ".webm", ".avi", ".mov"];
let isVideo = false;
for (let i = 0; i < videoExtensions.length; i++) {
if (pathname.includes(videoExtensions[i])) {
isVideo = true;
}
}
console.log(isVideo);
console.log(url.pathname);
// And the most obvious one - checking the content-type
if (headers["content-type"]?.includes("video")) {
console.log("video");
console.log(url.pathname);
}
});
}
当我在
headless: false
中启动脚本时,我看到视频已加载,但使用上面的检查却找不到它,也许我做错了什么?
启动 puppeteer 的代码如下所示:
const endpoint =
"wsEndpoint";
const browser = await puppeteer.connect({ browserWSEndpoint: endpoint });
const page = await browser.newPage();
await page.goto(
"https://learning.edx.org/course/course-v1:MITx+6.431x+1T2024/block-v1:MITx+6.431x+1T2024+type@sequential+block@Lec__1_Probability_models_and_axioms"
);
const downloadUtils = new DownloadUtils(page);
await downloadUtils.enableDownload();
看起来您的方向是正确的,但检测视频文件可能比仅检查传输编码或内容长度更棘手。 EdX 等平台上的视频可能会以不简单的方式受到保护或加载。也尝试关注
Content-Type
标题。视频通常具有 video/mp4
或 video/mpeg
等类型。如果平台使用流协议(例如 HLS 或 DASH),请查找 application/vnd.apple.mpegurl
或 application/dash+xml
。这些可能会引导您找到识别视频响应的正确路径。另外,请考虑是否有任何 DRM 可能阻止直接下载。