我已经构建了一个 Crawlee scrapper,但由于某种原因它多次调用相同的处理程序,在我的数据集中创建了大量重复的请求和条目。另外:
uniqueKey
。maxConcurrency: 1
。以下是相关(简化)文件:
main.ts
:
await Actor.init();
const crawler = new CheerioCrawler({
requestHandler: router,
sameDomainDelaySecs: 3,
maxRequestRetries: 3,
maxConcurrency: 1,
});
const originalAddRequestsFn = crawler.addRequests.bind(crawler);
crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
if (requests.length > 1) {
log.info(`INITIAL REQUESTS = ${ requests.length }`);
} else {
log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
}
return originalAddRequestsFn(requests, options);
}
const requestsOptions: RequestOptions<ScrapperData>[] = [{
uniqueKey: `ROUTE_A_${ dataset[0].startURL }`,
url: dataset[0].startURL,
label: RouterHandlerLabels.ROUTE_A,
userData: { datasetIndex: 0 },
}, {
uniqueKey: `ROUTE_A_${ dataset[1].startURL }`,
url: dataset[1].startURL,
label: RouterHandlerLabels.ROUTE_A,
userData: { datasetIndex: 1 },
}];
try {
await crawler.run(requestsOptions);
await Dataset.exportToJSON(JSON_OUTPUT_FILE_KEY);
} finally {
await Actor.exit();
}
router.ts
:
export enum RouterHandlerLabels {
ROUTE_A = 'route-a',
ROUTE_B = 'route-b',
ROUTE_C = 'route-c',
}
export const router = createCheerioRouter();
router.addHandler(RouterHandlerLabels.ROUTE_A, handlerA);
router.addHandler(RouterHandlerLabels.ROUTE_B, handlerB);
router.addHandler(RouterHandlerLabels.ROUTE_C, handlerC);
router.addDefaultHandler(async ({ log }) => {
log.info('Default handler...');
});
handler-a.ts
:
export async function handlerA({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`A. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const nextURL = findLinkToB(pageHTML);
if (!nextURL) return;
log.info('A. Call addRequests(...)');
await crawler.addRequests([{
uniqueKey: `ROUTE_B_${ nextURL }`,
url: nextURL,
headers: DEFAULT_REQUEST_HEADERS,
label: RouterHandlerLabels.ROUTE_B,
userData: request.userData,
}]);
}
handler-b.ts
:
export async function handlerB({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`B. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const nextURL = findLinkToC(pageHTML);
if (!nextURL) return;
log.info('B. Call addRequests(...)');
await crawler.addRequests([{
uniqueKey: `ROUTE_C_${ nextURL }`,
url: nextURL,
headers: DEFAULT_REQUEST_HEADERS,
label: RouterHandlerLabels.ROUTE_C,
userData: request.userData,
}]);
}
handler-c.ts
:
export async function handlerC({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`C. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const extractedData = findDataInPageC(pageHTML);
if (!extractedData) return;
log.info(`C. Saving data for ${ datasetIndex }`);
await pushData({ ...extractedData, datasetIndex });
}
这些是我得到的日志:
INFO System info {"apifyVersion":"3.1.12","apifyClientVersion":"2.8.1","crawleeVersion":"3.5.8","osType":"Linux","nodeVersion":"v20.8.1"}
INFO INITIAL REQUESTS = 2
INFO CheerioCrawler: Starting the crawler.
INFO CheerioCrawler: A. 0: https://example.com/page-a/user-0
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-0 = https://example.com/page-b/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO Statistics: CheerioCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5599,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":50388,"requestsTotal":9,"crawlerRuntimeMillis":61279,"retryHistogram":[9]}
INFO CheerioCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":0.858},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO CheerioCrawler: Final request statistics: {"requestsFinished":19,"requestsFailed":0,"retryHistogram":[19],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5150,"requestsFinishedPerMinute":10,"requestsFailedPerMinute":0,"requestTotalDurationMillis":97844,"requestsTotal":19,"crawlerRuntimeMillis":115660}
INFO CheerioCrawler: Finished! Total 19 requests: 19 succeeded, 0 failed. {"terminal":true}
在本例中,它总共产生了
7
结果:第一个数据集条目为 4
,第二个数据集条目为 3
(实际上每个数据集应该只有一个,因此总计结果为 2
)。
日志上的13
行将是第一条没有意义的行:
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
至此,对 page-a
的两项请求(一项针对
user-0
和一项针对
user-1
)均已得到处理(分别为
4
和
7
行)。我尝试仅添加 1 个初始请求(调用
crawler.run(...)
时),但某些处理程序仍然会针对同一请求多次调用。我正在使用
crawlee
3.5.8
。
Discord 的一些帮助,这是一个 已知的错误:
当我们将看起来这个问题是在
sameDomainDelaySecs
功能与[email protected]
结合使用时,会特别出现此问题。有趣的是,当我们使用与[email protected]
相同的功能时,我们不会遇到这个问题。因此,我们怀疑此警告可能与此修复有关#2045
。
sameDomainDelaySecs
版本中使用
3.5.4
时出现的,所以有两种解决方案:
3.5.2
。
sameDomainDelaySecs
。或者,您可以在处理程序末尾使用
await sleep(delayInMs)
:
export async function sleep(delayInMs) {
return new Promise((resolve) => setTimeout(resolve, delayInMs))
}