我是apache-nutch
的新手,并且想在stackoverflow
上抓几个问题。我的urls/seed.txt
具有以下数据:-
/questions/58763948/setting-a-list-item-is-converting-it-into-a-tuple
/questions/58763947/start-up-eclipse-an-error-has-occured-see-the-log-file
/questions/58763946/problem-with-the-proxy-using-zap-docker-image-gitlab
/questions/58763945/how-to-select-unique-random-data-based-on-percent-in-sql
/questions/58763943/probelm-with-using-filter-function-to-remove-missing-values-form-a-dataset
/questions/58763942/flutter-keep-data-in-textfield-after-setstate
/questions/58763941/are-receipts-generated-by-google-play-api-v2-and-the-latest-version-v3-compatibl
/questions/58763940/how-to-add-eventhandler-to-popupmenuitem-in-flutter
/questions/58763938/how-to-solve-electron-and-grpc-version-problem-in-angular-project
...
property
中是否可以包含任何nutch-site.xml
,以便在https://stackoverflow.com
中的每个网址之前添加seed.txt
。由于文件很大,我不想更改seed.txt
regex-normalize.xml
中添加以下规则来做到这一点:<regex>
<pattern>^/</pattern>
<substitution>https://stackoverflow.com/</substitution>
</regex>
还请确保属性urlnormalizer-regex
中包含插件plugin.includes
。