Detectar Bots y Web Spiders de los navegadores que nos visitan

Actualizado el sábado, 5 mayo, 2018

Prácticamente la totalidad de los sitios web implementan su registro de visitas en una base de datos. Después de un tiempo se comienzan a acumular datos, que posteriormente se hace evidente que algunos de esos datos son basura debido a las arañas y los robots que están verificando el sitio. Este tipo de robots utilizan lineas de navegador únicas (HTTP_USER_AGENT), lo que hace que sean fáciles de identificar.

Con esta simple función, evitamos que nos contabilice como visita en nuestro registro.

function is_bot(){
 
    $bots = array(
        'Googlebot', 'Baiduspider', 'ia_archiver',
        'R6_FeedFetcher', 'NetcraftSurveyAgent', 'Sogou web spider',
        'bingbot', 'Yahoo! Slurp', 'facebookexternalhit', 'PrintfulBot',
        'msnbot', 'Twitterbot', 'UnwindFetchor',
        'urlresolver', 'Butterfly', 'TweetmemeBot' );
 
 
    foreach($bots as $b){
 
        if( stripos( $_SERVER['HTTP_USER_AGENT'], $b ) !== false ) return true;
 
    }
 
 
 
    return false;
 
}

Motores de búsqueda, arañas y rastreadores (los mas conocidos).

  • Baiduspider+(+http://www.baidu.com/search/spider.htm)
  • Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  • Moreoverbot/5.1 (+http://w.moreover.com; webmaster@moreover.com) Mozilla/5.0
  • UnwindFetchor/1.0 (+http://www.gnip.com/)
  • Voyager/1.0
  • PostRank/2.0 (postrank.com)
  • R6_FeedFetcher(www.radian6.com/crawler)
  • R6_CommentReader(www.radian6.com/crawler)
  • radian6_default_(www.radian6.com/crawler)
  • Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
  • ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
  • Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Mozilla/5.0 (en-us) AppleWebKit/525.13 (KHTML, like Gecko; Google Web Preview) Version/3.1 Safari/525.13
  • Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  • Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
  • Twitterbot/0.1
  • LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com)
  • bitlybot
  • MetaURI API/2.0 +metauri.com
  • Mozilla/5.0 (compatible; Birubot/1.0) Gecko/2009032608 Firefox/3.0.8
  • Mozilla/5.0 (compatible; PrintfulBot/1.0; +http://printful.com/bot.html)
  • Mozilla/5.0 (compatible; PaperLiBot/2.1)
  • Summify (Summify/1.0.1; +http://summify.com)
  • Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)
  • PycURL/7.18.2
  • facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  • Python-urllib/2.6
  • Python-httplib2/$Rev$
  • AppEngine-Google; (+http://code.google.com/appengine; appid: lookingglass-server)
  • Wget/1.9+cvs-stable (Red Hat modified)
  • Mozilla/5.0 (compatible; redditbot/1.0; +http://www.reddit.com/feedback)
  • Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot
  • Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)
  • Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
  • Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
  • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1 + FairShare-http://fairshare.cc)
  • HTTP_Request2/2.0.0beta3 (http://pear.php.net/package/http_request2) PHP/5.3.2
  • Mozilla/5.0 (compatible; Embedly/0.2; +http://support.embed.ly/)
  • magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
  • (TalkTalk Virus Alerts Scanning Engine)
  • Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
  • Googlebot/2.1 )
  • msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
  • msnbot/2.0b (+http://search.msn.com/msnbot.htm)
  • msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
  • Mozilla/5.0 (compatible; oBot/2.3.1; +http://www-935.ibm.com/services/us/index.wss/detail/iss/a1029077?cntxt=a1027244)
  • Sosospider+(+http://help.soso.com/webspider.htm)
  • COMODOspider/Nutch-1.0
  • trunk.ly spider contact@trunk.ly
  • Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
  • Mozilla/5.0 (compatible; MJ12bot/v1.4.0; http://www.majestic12.co.uk/bot.php?+)
  • knowaboutBot 0.01
  • Showyoubot )
  • Flamingo_SearchEngine (+http://www.flamingosearch.com/bot)
  • MLBot (www.metadatalabs.com/mlbot)
  • my-robot/0.1
  • Mozilla/5.0 (compatible; woriobot support at worio dot com +http://worio.com)
  • Mozilla/5.0 (compatible; YoudaoBot/1.0; ; )
  • chilitweets.com
  • Mozilla/5.0 (TweetBeagle;
  • OctoBot/2.1 (OctoBot/2.1.0; +http://www.octofinder.com/octobot.html?2.1)
  • Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)
  • Mozilla/5.0 (compatible; WASALive Bot ; https://udger.com/resources/ua-list/bot-detail?bot=WASALive-Bot
  • Mozilla/5.0 (compatible; Apercite; +http://www.apercite.fr/robot/index.html)
  • urlfan-bot/1.0; +http://www.urlfan.com/site/bot/350.html
  • SeznamBot/3.0 (+http://fulltext.sblog.cz/)
  • Yeti/1.0 (NHN Corp.;
  • Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.0; trendictionbot0.4.2; trendiction media ssppiiddeerr; http://www.trendiction.com/bot/; please let us know of any problems; ssppiiddeerr at trendiction.com) Gecko/20071127 Firefox/2.0.0.11
  • yacybot (freeworld/global; amd64 Linux 2.6.35-24-generic; java 1.6.0_20; Asia/en) http://yacy.net/bot.html
  • Mozilla/5.0 (compatible; suggybot v0.01a,
  • ssearch_bot (sSearch Crawler; http://www.semantissimo.de)
  • Mozilla/5.0 (compatible; Linux; Socialradarbot/2.0; en-US; crawler@infegy.com)
  • wikiwix-bot-3.0
  • Mozilla/5.0 (compatible; AhrefsBot/1.0; +http://ahrefs.com/robot/)
  • Mozilla/5.0 (compatible; DotBot/1.1; , crawler@dotnetdotcom.org)
  • GarlikCrawler/1.1 (http://garlik.com/, crawler@garik.com)
  • Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)
  • Mozilla/5.0 (compatible; 008/0.83; Gecko/2008032620
  • PostPost/1.0 (+http://postpo.st/crawlers)
  • Aghaven/Nutch-1.2 (www.aghaven.com)
  • SBIder/Nutch-1.0-dev (http://www.sitesell.com/sbider.html)
  • Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)
  • Soup/2011-05-11Z11-51-38–soup–production-2-g251c1f9d/251c1f9d6cdff8491e0b49f4ba3288ec7f3de903 (http://soup.io/)
  • Trapit/1.1
  • Jakarta Commons-HttpClient/3.1
  • Readability/0.1
  • kame-rt (support@backtype.com)
  • Mozilla/5.0 (compatible; Topix.net;
  • Megite2.0 https://techcrunch.com/tag/megite/)
  • SkyGrid/1.0 (+http://skygrid.com/partners)
  • Netvibes (http://www.netvibes.com)
  • Zemanta Aggregator/0.7 +http://www.zemanta.com
  • Owlin.com/1.3 (http://owlin.com/)
  • Mozilla/5.0 (compatible; Twitturls; +http://twitturls.com)
  • Tumblr/1.0 RSS syndication (+http://www.tumblr.com/) (support@tumblr.com)
  • Mozilla/4.0 (compatible; www.euro-directory.com; urlchecker1.0)
  • Covario-IDS/1.0 (Covario; ; support at covario dot com)

Fuente: http://www.phacks.net