regex-urlfilter.txt文件中的Apache Nutch url

Question

我是新来的爬行者，特别是Apache坚果。阿帕奇坚果的配置确实很复杂。我已经通过apache进行了很多研究，并找到了regex-urlfilter.txt文件，在其中必须提到要爬网的页面并限制爬网。因为没有关于此的好的/简单的教程，所以我在这里。该问题的解释如下。

说明

假设我有一个名为https://www.example.com的网站。现在，为了只对这个网站进行爬网并限制我的爬网，我知道我必须像这样+^https://www.example.com/编辑我的regex-urlfilter.txt文件。现在，如果我想进一步限制它呢？例如，我只想抓取该给定网站中的某些页面。

https://www.example.com/something/details/1
https://www.example.com/something/details/2
https://www.example.com/something/details/3
https://www.example.com/something/details/4
https://www.example.com/something/details/5
.
.
.
https://www.example.com/something/details/10

P.S：作为新成员，在提出一个好的问题时，我可能犯了很多错误。请帮助我改善问题，而不是给-1。我将非常感谢大家。

Answer 1

如果只想抓取https://www.example.com/something/details/及以下版本，请从以下位置替换regex-urlfilter.txt的最后一行：

# accept anything else
+.

收件人：

+https://www.example.com/something/details/
-.

将仅包含包含https://www.example.com/something/details/的URL，并忽略所有其他URL。

regex-urlfilter.txt文件中的Apache Nutch url

问题描述投票：0回答：1

1个回答

最新问题

regex-urlfilter.txt文件中的Apache Nutch url

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1