如何使用PHP检测爬虫/蜘蛛?

问题描述 投票:7回答:2

如何使用PHP检测爬虫/蜘蛛?

我目前正在从事一个项目,需要跟踪每个爬虫的访问。我知道您应该使用HTTP_USER_AGENT,但是我不太确定如何为此设置格式,而且我知道USER AGENT可以很容易地更改,因此i还想知道是否可以添加一些更多参数以避免欺骗?

我正在尝试执行的示例代码。

<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>

谢谢

php user-agent
2个回答
12
投票

根据Verifying Googlebot

您可以通过使用反向DNS查找,验证名称是否在googlebot.com域中,然后使用该域名进行正向DNS查找,来验证访问您服务器的漫游器确实是Googlebot(或其他Google用户代理) googlebot名称。如果您担心垃圾邮件发送者或其他麻烦制造者在声称自己是Googlebot时正在访问您的网站,则此功能非常有用。

例如:

host 66.249.66.11.66.249.66.in-addr.arpa domain name pointercrawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.comcrawl-66-249-66-1.googlebot.com has address 66.249.66.1Google不会发布公共IP地址列表,以使网站管理员可以将其列入白名单。这是因为这些IP地址范围可能会发生变化,从而给所有对其进行硬编码的网站站长都带来麻烦。识别Googlebot访问的最佳方法是使用用户代理(Googlebot)。

您可以进行反向DNS查找:

function validateGoogleBotIP($ip) {
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"
    return preg_match('/\.googlebot\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
    if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
        echo 'It is ACTUALLY google';
    } else {
        echo 'Someone\'s faking it!';
    }
} else {
    echo 'Nothing to do with Google';
}

0
投票

[100%在我的网站上工作]来检测机器人,爬行器,蜘蛛和复印机。

function isBotDetected() {

    if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
    ) {
        return true; // 'Above given bots detected'
    }

    return false;

} // End :: isBotDetected()
© www.soinside.com 2019 - 2024. All rights reserved.