在抓取 EdX 视频时检查 Puppeteer 中的响应时无法检测到视频文件

问题描述 投票:0回答:1

我正在创建一个机器人,它将在 EdX 上浏览一些视频并下载它们,在每个响应中,我都会检查标题以确定内容是否是视频,以及我尝试过的检查响应的方法用途是:

 async enableDownload(): Promise<void> {
    this.page.on("response", async (response: HTTPResponse) => {
      const url = new URL(response.url());
      const headers = response.headers();

      // Checking if the video is loaded in chunks
      if (headers["transfer-encoding"] === "chunked") {
        console.log("Chunked transfer detected for URL:", response.url());
        // Optionally, you can try to read some of the response body to inspect it:

        const text = await response.text();
        console.log(text)
      }

        // Checking the length in order to find big files
        if (headers["content-length"]) {
          const contentLength = parseInt(headers["content-length"], 10);
          // Define a threshold, e.g., 5 MB
          console.log(contentLength);
          if (contentLength > 100000) {
            console.log("Large file detected:", response.url());
          }
        }

        // Checking for the extension of the file
        const pathname = url.pathname;
        const videoExtensions = [".mp4", ".webm", ".avi", ".mov"];
        let isVideo = false;
        for (let i = 0; i < videoExtensions.length; i++) {
          if (pathname.includes(videoExtensions[i])) {
            isVideo = true;
          }
        }
        console.log(isVideo);
        console.log(url.pathname);

        // And the most obvious one - checking the content-type
        if (headers["content-type"]?.includes("video")) {
          console.log("video");
          console.log(url.pathname);
        }
    });
  }

当我在

headless: false
中启动脚本时,我看到视频已加载,但使用上面的检查却找不到它,也许我做错了什么?

启动 puppeteer 的代码如下所示:

 const endpoint =
    "wsEndpoint";
  const browser = await puppeteer.connect({ browserWSEndpoint: endpoint });
  const page = await browser.newPage();

  await page.goto(
    "https://learning.edx.org/course/course-v1:MITx+6.431x+1T2024/block-v1:MITx+6.431x+1T2024+type@sequential+block@Lec__1_Probability_models_and_axioms"
  );

  const downloadUtils = new DownloadUtils(page);
  await downloadUtils.enableDownload();
typescript web-scraping puppeteer screen-scraping
1个回答
0
投票

看起来您的方向是正确的,但检测视频文件可能比仅检查传输编码或内容长度更棘手。 EdX 等平台上的视频可能会以不简单的方式受到保护或加载。也尝试关注

Content-Type
标题。视频通常具有
video/mp4
video/mpeg
等类型。如果平台使用流协议(例如 HLS 或 DASH),请查找
application/vnd.apple.mpegurl
application/dash+xml
。这些可能会引导您找到识别视频响应的正确路径。另外,请考虑是否有任何 DRM 可能阻止直接下载。

© www.soinside.com 2019 - 2024. All rights reserved.