Crawlee scrapper 多次调用同一个处理程序

问题描述 投票:0回答:1

我已经构建了一个 Crawlee scrapper,但由于某种原因它多次调用相同的处理程序,在我的数据集中创建了大量重复的请求和条目。另外:

  • 我已经尝试为我的所有请求手动设置
    uniqueKey
  • 我也尝试过为爬虫设置
    maxConcurrency: 1
  • 正如您从下面的日志中看到的,问题不在于我多次添加相同的请求。是 Crawlee 使用相同的请求多次调用处理程序。

以下是相关(简化)文件:

main.ts

await Actor.init();

const crawler = new CheerioCrawler({
  requestHandler: router,
  sameDomainDelaySecs: 3,
  maxRequestRetries: 3,
  maxConcurrency: 1,
});

const originalAddRequestsFn = crawler.addRequests.bind(crawler);

crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
  if (requests.length > 1) {
    log.info(`INITIAL REQUESTS = ${ requests.length }`);
  } else {
    log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
  }

  return originalAddRequestsFn(requests, options);
}

const requestsOptions: RequestOptions<ScrapperData>[] = [{
  uniqueKey: `ROUTE_A_${ dataset[0].startURL }`,
  url: dataset[0].startURL,
  label: RouterHandlerLabels.ROUTE_A,
  userData: { datasetIndex: 0 },
}, {
  uniqueKey: `ROUTE_A_${ dataset[1].startURL }`,
  url: dataset[1].startURL,
  label: RouterHandlerLabels.ROUTE_A,
  userData: { datasetIndex: 1 },
}];

try {
  await crawler.run(requestsOptions);
  await Dataset.exportToJSON(JSON_OUTPUT_FILE_KEY);
} finally {
  await Actor.exit();
}

router.ts

export enum RouterHandlerLabels {
  ROUTE_A = 'route-a',
  ROUTE_B = 'route-b',
  ROUTE_C = 'route-c',
}

export const router = createCheerioRouter();

router.addHandler(RouterHandlerLabels.ROUTE_A, handlerA);
router.addHandler(RouterHandlerLabels.ROUTE_B, handlerB);
router.addHandler(RouterHandlerLabels.ROUTE_C, handlerC);

router.addDefaultHandler(async ({ log }) => {
  log.info('Default handler...');
});

handler-a.ts

export async function handlerA({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
  const { datasetIndex } = request.userData;

  log.info(`A. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);

  const pageHTML = $('body').html() || '';
  const nextURL = findLinkToB(pageHTML);

  if (!nextURL) return;

  log.info('A. Call addRequests(...)');

  await crawler.addRequests([{
    uniqueKey: `ROUTE_B_${ nextURL }`,
    url: nextURL,
    headers: DEFAULT_REQUEST_HEADERS,
    label: RouterHandlerLabels.ROUTE_B,
    userData: request.userData,
  }]);
}

handler-b.ts

export async function handlerB({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
  const { datasetIndex } = request.userData;

  log.info(`B. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);

  const pageHTML = $('body').html() || '';
  const nextURL = findLinkToC(pageHTML);

  if (!nextURL) return;

  log.info('B. Call addRequests(...)');

  await crawler.addRequests([{
    uniqueKey: `ROUTE_C_${ nextURL }`,
    url: nextURL,
    headers: DEFAULT_REQUEST_HEADERS,
    label: RouterHandlerLabels.ROUTE_C,
    userData: request.userData,
  }]);
}

handler-c.ts

export async function handlerC({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
  const { datasetIndex } = request.userData;

  log.info(`C. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);

  const pageHTML = $('body').html() || '';
  const extractedData = findDataInPageC(pageHTML);

  if (!extractedData) return;

  log.info(`C. Saving data for ${ datasetIndex }`);

  await pushData({ ...extractedData, datasetIndex });
}

这些是我得到的日志:

INFO  System info {"apifyVersion":"3.1.12","apifyClientVersion":"2.8.1","crawleeVersion":"3.5.8","osType":"Linux","nodeVersion":"v20.8.1"}
INFO  INITIAL REQUESTS = 2
INFO  CheerioCrawler: Starting the crawler.
INFO  CheerioCrawler: A. 0: https://example.com/page-a/user-0
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-0 = https://example.com/page-b/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  Statistics: CheerioCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5599,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":50388,"requestsTotal":9,"crawlerRuntimeMillis":61279,"retryHistogram":[9]}
INFO  CheerioCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":0.858},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO  CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO  CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO  CheerioCrawler: C. Saving data for 1
INFO  CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO  CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO  CheerioCrawler: C. Saving data for 1
INFO  CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO  CheerioCrawler: C. Saving data for 1
INFO  CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  CheerioCrawler: Final request statistics: {"requestsFinished":19,"requestsFailed":0,"retryHistogram":[19],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5150,"requestsFinishedPerMinute":10,"requestsFailedPerMinute":0,"requestTotalDurationMillis":97844,"requestsTotal":19,"crawlerRuntimeMillis":115660}
INFO  CheerioCrawler: Finished! Total 19 requests: 19 succeeded, 0 failed. {"terminal":true}

在本例中,它总共产生了

7
结果:第一个数据集条目为
4
,第二个数据集条目为
3
(实际上每个数据集应该只有一个,因此总计结果为
2
)。

日志上的

13

行将是第一条没有意义的行:

INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
至此,对 

page-a

 的两项请求(一项针对 
user-0
 和一项针对 
user-1
)均已得到处理(分别为 
4
7
 行)。

我尝试仅添加 1 个初始请求(调用

crawler.run(...)

 时),但某些处理程序仍然会针对同一请求多次调用。

我正在使用

crawlee

 
3.5.8

javascript node.js web-crawler apify crawlee
1个回答
0
投票
好的,所以我从 Apify 获得了有关其

Discord 的一些帮助,这是一个 已知的错误

当我们将

sameDomainDelaySecs

 功能与 
[email protected]
 结合使用时,会特别出现此问题。有趣的是,当我们使用与 
[email protected]
 相同的功能时,我们不会遇到这个问题。因此,我们怀疑此警告可能与此修复有关
#2045

看起来这个问题是在

sameDomainDelaySecs

版本中使用
3.5.4
时出现的,所以有两种解决方案:

  • 请使用版本

    3.5.2

  • 停止使用

    sameDomainDelaySecs

    。或者,您可以在处理程序末尾使用 
    await sleep(delayInMs)

    export async function sleep(delayInMs) { return new Promise((resolve) => setTimeout(resolve, delayInMs)) }
    
    
© www.soinside.com 2019 - 2024. All rights reserved.