目前,我正在学习 Node.js 和 Puppeteer 库以进行抓取。我有一个关于重定向的问题。我遇到了这样的场景:我想抓取 URL 'https://www.facebook.com/ladbible',但 Facebook 重定向到正确的 URL 'https://www.facebook.com/LADbible'。
在浏览器检查工具中检查网络选项卡时,Facebook 上的初始加载会返回状态 302,然后 Facebook 自动重定向到正确的 URL。我附上了一张图片供参考。
这是我的代码的样子:
const puppeteer = require("puppeteer");
const testRun = async () => {
const browser = await puppeteer.launch({ headless: false, executablePath: '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' });
const page = await browser.newPage();
const link = 'https://www.facebook.com/ladbible/'
page.on("requestfinished", async (request) => {
const response = request.response();
if (response.url().includes(link) && response.status() == 200) {
console.log(response.url());
}
});
await page.goto(link, { waitUntil: "networkidle2", timeout: 0 });
};
testRun();
既然我想设置监听器进行抓取,是否可以获取最终的重定向,然后触发监听器获取响应?”
您不需要使用监听器来获取最终的导航响应。 Puppeteer 默认监听响应并将最终响应作为结果返回给
page.goto()
:
const puppeteer = require("puppeteer");
const testRun = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
const url = 'https://www.facebook.com/ladbible/'
console.log('initial', url)
const result = await page.goto(url, { waitUntil: "networkidle2", timeout: 0 });
console.log('final', result.url()); // 'https://www.facebook.com/LADbible'
console.log('redirects', result.request().redirectChain().length); // 1
};
testRun();
如果您仍然想使用侦听器,无论出于何种原因,您可以执行以下操作:
const puppeteer = require("puppeteer");
const testRun = async () => {
const browser = await puppeteer.launch({ headless: false});
const page = await browser.newPage();
const link = 'https://www.facebook.com/ladbible/'
let navRequest;
page.on("requestfinished", async (request) => {
if (request.isNavigationRequest()) {
navRequest = request;
}
});
await page.goto(link, { waitUntil: "networkidle2", timeout: 0 });
console.log('final', navRequest.response().url()); 'https://www.facebook.com/LADbible'
console.log('redirects', navRequest.redirectChain().length); // 1
};
testRun();