我试图找出我的无头程序检索的数据比图形程序少的确切原因。
您可以在here找到存储库,并且要运行代码,您必须拥有 TikTok 帐户。这是因为,如果您将 cookie 加载到浏览器中,它就会消除弹出窗口,并使程序更容易编写。
克隆后,您将运行
node cookieLoader.js
并登录您的抖音帐户,然后按回车键即可运行主程序。
然后尝试这个命令(默认无头)
node index.js -m undertimeslopper
如果存储库不再存在或者您不想克隆它,您可以按照这些代码片段进行操作。一旦您登录该网站并按 Enter 键,此代码段将创建您的 tiktok cookie。
const readline = require('readline');
const { exit } = require("process");
const fs = require('fs');
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
// Apply the stealth plugin to avoid being detected as a bot
puppeteer.use(StealthPlugin());
(async () => {
const readline = require('node:readline').createInterface({
input: process.stdin,
output: process.stdout,
});
const browser = await puppeteer.launch({ headless: false });
// Open a new page
const page = await browser.newPage();
// Navigate to your desired URL
await page.goto('https://www.tiktok.com');
readline.question(`Press enter button to save your cookies\n`, async ()=> {
readline.close();
const cookies = await page.cookies();
console.log(cookies)
await fs.writeFileSync('./cookies.json', JSON.stringify(cookies, null, 2));
exit()
});
})();
然后您可以使用此代码片段运行实际程序。
const chalk = require("chalk");
const fs = require("fs");
const puppeteer = require("puppeteer");
const { exit } = require("process");
const path = require("path");
const loadCookie = async (page) => {
//could be useful in future so ill keep it
const cookieJson = await fs.readFileSync(path.join(__dirname,'cookies.json'));
const cookies = JSON.parse(cookieJson);
await page.setCookie(...cookies);
}
const generateUrlProfile = (username) => {
var baseUrl = "https://www.tiktok.com/";
if (username.includes("@")) {
baseUrl = `${baseUrl}${username}`;
} else {
baseUrl = `${baseUrl}@${username}`;
}
return baseUrl;
};
const getListVideoByUsername = async (username) => {
var baseUrl = await generateUrlProfile(username)
const browser = await puppeteer.launch({
headless: true,
})
const page = await browser.newPage()
await page.setRequestInterception(true)
page.on('request', (request) => {
if (request.resourceType() === 'image') request.abort()
else request.continue()
})
await loadCookie(page);
page.setUserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4182.0 Safari/537.36"
);
await page.goto(baseUrl).catch(err =>{
console.error(err)
exit();
});
await page.keyboard.press('Escape')
const delay_milliseconds=3000+500
const delay_after_load=1000
await page.keyboard.press('Escape')
try {
await sleep(delay_milliseconds)
const xpathSelector = "//button[contains(text(),'Refresh')]"; // Replace with your XPath
await page.evaluate(xpath => {
const xpathResult = document.evaluate(xpath, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null);
const element = xpathResult.singleNodeValue;
if (element) {
element.click()
}
}, xpathSelector);
await sleep(delay_after_load)
}
catch (error) {
}
await page.keyboard.press('Escape')
var listVideo = []
console.log(chalk.green("[*] Getting list video from: " + username))
var loop = true
var no_video_found=false
while(loop) {
listVideo = await page.evaluate(() => {
const listVideo = document.querySelectorAll('a');
const videoUrls2 = Array.from(listVideo).map(item => item.href)
.filter(href => href.includes('/video/') || href.includes('/photo/'))
.filter((value, index, self) => self.indexOf(value) === index).map(item=>item.replace('photo','video'));
return videoUrls2;
});
console.log(chalk.green(`[*] ${listVideo.length} video found`))
previousHeight = await page.evaluate("document.body.scrollHeight").catch(() => {
});
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)").catch(() => {
})
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`, {timeout: 10000})
.catch(() => {
console.log(chalk.red("[X] No more video found"));
console.log(chalk.green(`[*] Total video found: ${listVideo.length}`))
loop = false
if(listVideo.length===0){
no_video_found=true
}
});
await new Promise((resolve) => setTimeout(resolve, 1000));
}
await browser.close()
return listVideo
}
(async() => {
getListVideoByUsername('undertimeslopper') // or any valid tiktok username
})()
我机器上的输出是
[*] Getting list video from: undertimeslopper
[*] 35 video found
[*] 69 video found
[*] 69 video found
[X] No more video found
但是当我转到
getListVideoByUsername
中的第 5 行并将 headless: true
更改为 headless: false
后,输出为
[*] Getting list video from: undertimeslopper
[*] 35 video found
[*] 69 video found
[*] 102 video found
[*] 137 video found
[*] 158 video found
[X] No more video found
正如我们所观察到的,图形程序按预期执行:抓取所有用户的视频,而无头程序只得到 69。
这是问题的核心,因为我打算在服务器上无头运行此脚本,如果我无法获取所有视频,则它毫无价值。
您不必运行代码来帮助我。本质上,我只是在寻找调试方法并查看无头浏览器正在做什么,但我将说明和输出作为补充信息包含在内。
此问题要么是由
puppeteer
库的缓存问题引起的,要么只是由库的版本引起的。升级后 puppeteer
进行以下更改
"puppeteer": "^13.7.0",
到
"puppeteer": "^22.5.0",
问题已解决,程序按预期运行。