无头浏览器没有抓取尽可能多的数据(调试)

问题描述 投票:0回答:1

我试图找出我的无头程序检索的数据比图形程序少的确切原因。

您可以在here找到存储库,并且要运行代码,您必须拥有 TikTok 帐户。这是因为,如果您将 cookie 加载到浏览器中,它就会消除弹出窗口,并使程序更容易编写。

克隆后,您将运行

node cookieLoader.js
并登录您的抖音帐户,然后按回车键即可运行主程序。

然后尝试这个命令(默认无头)

node index.js -m undertimeslopper

如果存储库不再存在或者您不想克隆它,您可以按照这些代码片段进行操作。一旦您登录该网站并按 Enter 键,此代码段将创建您的 tiktok cookie。

const readline = require('readline');
const { exit } = require("process");
const fs = require('fs');
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

// Apply the stealth plugin to avoid being detected as a bot
puppeteer.use(StealthPlugin());

(async () => {
    const readline = require('node:readline').createInterface({
        input: process.stdin,
        output: process.stdout,
    });
        
 
  const browser = await puppeteer.launch({ headless: false });

  // Open a new page
  const page = await browser.newPage();

  // Navigate to your desired URL
  await page.goto('https://www.tiktok.com');

  readline.question(`Press enter button to save your cookies\n`, async ()=> {
            readline.close();
            const cookies = await page.cookies();
            console.log(cookies)
            await fs.writeFileSync('./cookies.json', JSON.stringify(cookies, null, 2));
            exit()
        });
 
})();

然后您可以使用此代码片段运行实际程序。

const chalk = require("chalk");
const fs = require("fs");
const puppeteer = require("puppeteer");
const { exit } = require("process");
const path = require("path");


const loadCookie = async (page) => {
    //could be useful in future so ill keep it
    const cookieJson = await fs.readFileSync(path.join(__dirname,'cookies.json'));
    const cookies = JSON.parse(cookieJson);
    await page.setCookie(...cookies);
}

const generateUrlProfile = (username) => {
    var baseUrl = "https://www.tiktok.com/";
    if (username.includes("@")) {
        baseUrl = `${baseUrl}${username}`;
    } else {
        baseUrl = `${baseUrl}@${username}`;
    }
    return baseUrl;
};

const getListVideoByUsername = async (username) => {

    var baseUrl = await generateUrlProfile(username)
  
    const browser = await puppeteer.launch({
        headless: true,
    })
    
    const page = await browser.newPage()

    await page.setRequestInterception(true)
    page.on('request', (request) => {
    if (request.resourceType() === 'image') request.abort()
    else request.continue()
    })

    await loadCookie(page);
    page.setUserAgent(
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4182.0 Safari/537.36"
      );
    await page.goto(baseUrl).catch(err =>{
         console.error(err)
         exit();
    });
    await page.keyboard.press('Escape')
    const delay_milliseconds=3000+500
    const delay_after_load=1000
    
    await page.keyboard.press('Escape')

    try {

        await sleep(delay_milliseconds)
    
        const xpathSelector = "//button[contains(text(),'Refresh')]"; // Replace with your XPath
        await page.evaluate(xpath => {
            const xpathResult = document.evaluate(xpath, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
            const element = xpathResult.singleNodeValue;
            if (element) {
                element.click()
            }
        }, xpathSelector);

        await sleep(delay_after_load)
    } 

    catch (error) {
    
    }
    
    await page.keyboard.press('Escape')
    var listVideo = []
    console.log(chalk.green("[*] Getting list video from: " + username))
    var loop = true
    var no_video_found=false
   
    while(loop) {
        listVideo = await page.evaluate(() => {
        const listVideo = document.querySelectorAll('a');
    
        const videoUrls2 = Array.from(listVideo).map(item => item.href)
            .filter(href => href.includes('/video/') || href.includes('/photo/'))
            .filter((value, index, self) => self.indexOf(value) === index).map(item=>item.replace('photo','video'));
            return videoUrls2;
        });
    
        console.log(chalk.green(`[*] ${listVideo.length} video found`))
        previousHeight = await page.evaluate("document.body.scrollHeight").catch(() => {
            
        });
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)").catch(() => {
        })
                          
        await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`, {timeout: 10000})
        .catch(() => {
            console.log(chalk.red("[X] No more video found"));
            console.log(chalk.green(`[*] Total video found: ${listVideo.length}`))
            loop = false
            if(listVideo.length===0){
                no_video_found=true
            }
            
            
        });
        await new Promise((resolve) => setTimeout(resolve, 1000));
    } 
    await browser.close()
    return listVideo
}

(async() => {
    getListVideoByUsername('undertimeslopper') // or any valid tiktok username
})()

我机器上的输出是

[*] Getting list video from: undertimeslopper
[*] 35 video found
[*] 69 video found
[*] 69 video found
[X] No more video found

但是当我转到

getListVideoByUsername
中的第 5 行并将
headless: true
更改为
headless: false
后,输出为

[*] Getting list video from: undertimeslopper
[*] 35 video found
[*] 69 video found
[*] 102 video found
[*] 137 video found
[*] 158 video found
[X] No more video found

正如我们所观察到的,图形程序按预期执行:抓取所有用户的视频,而无头程序只得到 69。

这是问题的核心,因为我打算在服务器上无头运行此脚本,如果我无法获取所有视频,则它毫无价值。

您不必运行代码来帮助我。本质上,我只是在寻找调试方法并查看无头浏览器正在做什么,但我将说明和输出作为补充信息包含在内。

puppeteer google-chrome-headless headless-browser
1个回答
0
投票

此问题要么是由

puppeteer
库的缓存问题引起的,要么只是由库的版本引起的。升级后
puppeteer
进行以下更改

"puppeteer": "^13.7.0",

"puppeteer": "^22.5.0",

问题已解决,程序按预期运行。

© www.soinside.com 2019 - 2024. All rights reserved.