阻止搜索蜘蛛使用robots.txt访问Rails 3嵌套资源

Question

我试图阻止谷歌，雅虎等人点击我/产品/ ID /购买页面，我不确定如何做到这一点。

我目前阻止他们点击登录以下内容：

User-agent: *
Disallow: /sign_in

我可以做以下的事情吗？

User-agent: *
Disallow: /products/*/purchase

或者应该是：

User-agent: *
Disallow: /purchase

Answer 1

我假设你想阻止/products/ID/purchase但允许/products/ID。

您的上一个建议只会阻止以“购买”开头的网页：

User-agent: *
Disallow: /purchase

所以这不是你想要的。

你需要你的第二个建议：

User-agent: *
Disallow: /products/*/purchase

这将阻止所有以/products/开头的网址，后跟任何字符，后跟/purchase。

注意：它使用通配符*。在原始的robots.txt“规范”中，这不是具有特殊含义的字符。但是，一些搜索引擎扩展了规范并将其用作一种通配符。所以它should work for Google和可能其他一些搜索引擎，但你不能打赌它会适用于所有其他爬虫/机器人。

所以你的robots.txt看起来像：

User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase

另请注意，某些搜索引擎（包括Google）可能仍在其搜索结果中列出了一个网址（没有标题/摘要），尽管它在robots.txt中被屏蔽了。当他们在允许抓取的页面上找到指向被阻止页面的链接时，可能就是这种情况。为了防止这种情况，你必须noindex文件。

Answer 2

According to Google Disallow: /products/*/purchase应该工作。但according to robotstxt.org这不起作用。