在Nutch1.16上没有获取任何错误的请求

问题描述 投票:0回答:1

我是apache-nutch的新手,并且想在stackoverflow上抓几个问题。我的urls/seed.txt具有以下数据:-

/questions/58763948/setting-a-list-item-is-converting-it-into-a-tuple
/questions/58763947/start-up-eclipse-an-error-has-occured-see-the-log-file
/questions/58763946/problem-with-the-proxy-using-zap-docker-image-gitlab
/questions/58763945/how-to-select-unique-random-data-based-on-percent-in-sql
/questions/58763943/probelm-with-using-filter-function-to-remove-missing-values-form-a-dataset
/questions/58763942/flutter-keep-data-in-textfield-after-setstate
/questions/58763941/are-receipts-generated-by-google-play-api-v2-and-the-latest-version-v3-compatibl
/questions/58763940/how-to-add-eventhandler-to-popupmenuitem-in-flutter
/questions/58763938/how-to-solve-electron-and-grpc-version-problem-in-angular-project
...

property中是否可以包含任何nutch-site.xml,以便在https://stackoverflow.com中的每个网址之前添加seed.txt。由于文件很大,我不想更改seed.txt

nutch
1个回答
0
投票
不,没有这样的配置属性。可以通过在regex-normalize.xml中添加以下规则来做到这一点:

<regex> <pattern>^/</pattern> <substitution>https://stackoverflow.com/</substitution> </regex>

还请确保属性urlnormalizer-regex中包含插件plugin.includes
© www.soinside.com 2019 - 2024. All rights reserved.