自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:

RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTS\sAgent|Creative\sAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$
RewriteRule .* - [F,L]

  屏蔽User Agent为空的代码:

RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule .* - [F]

  屏蔽Referer和User Agent都为空的代码:

RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^$ [NC]
RewriteRule .* - [F] 

  下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:

  • User-Agent
  • DTS Agent
  • HttpClient
  • Owlin
  • Kazehakase
  • Creative AutoUpdate
  • HTTrack
  • YisouSpider
  • baiduboxapp
  • Python-urllib
  • python-requests
  • SemrushBot
  • SearchmetricsBot
  • MegaIndex
  • Scrapy
  • EMail Exractor
  • 007ac9
  • ltx71

  其它也可以考虑屏蔽的:

  • Mail.RU_Bot:http://go.mail.ru/help/robots(link is external)
  • Feedly
  • ZumBot
  • Pcore-HTTP
  • Daum
  • your-server
  • Mobile/12A4345d
  • PhantomJS/2.1.1
  • archive.org_bot
  • AcooBrowser
  • Go-http-client
  • Jakarta Commons-HttpClient
  • Apache-HttpClient
  • BDCbot
  • ECCP
  • Nutch
  • cr4nk
  • MJ12bot
  • MOT-MPx220
  • Y!OASIS/TEST
  • libwww-perl

  一般不要屏蔽的主流搜索引擎特征:

  • Google
  • Baidu
  • Yahoo
  • Slurp
  • yandex
  • YandexBot
  • MSN

  一些常见浏览器或者通用代码也不要轻易屏蔽:

  • FireFox
  • Apple
  • PC
  • Chrome
  • Microsoft
  • Android
  • Mail
  • Windows
  • Mozilla
  • Safar
  • Macintosh

  有的时候是采集者单独设置的User Agent,也可以通过分析后进行屏蔽,例如:

RewriteCond %{HTTP_USER_AGENT} ^(.*)(\'Mozilla\/5\.0|\'Mozilla\'|\'Moz\'|\'Mozil\'|\'(.+)\'|Mobile\/13G34|Chrome\/53\.0\.2785\.143)(.*)$
RewriteRule .* - [F,L]

  或者与HTTP_USER_AGENT一起考虑其它的因素再联合判断检测、屏蔽,例如:

RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{HTTP_USER_AGENT} ^(.*)(Firefox\/44\.0|Safari\/537\.36)(.*)$
RewriteCond %{REQUEST_URI} ^(.*)\/comment\/reply\/(.*)$
RewriteRule .* - [F,L]

  上面这是遇到反复POST提交留言的情况,判断特征进行屏蔽。

  网上也找了一些其它的代码,列出供参考:

RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC]
RewriteRule ^(.*)$ - [F]

  除了修改.htaccess文件以外,还可以通过修改httpd.conf配置文件来实现:

DocumentRoot /home/wwwroot/xxx
<Directory "/home/wwwroot/xxx">
SetEnvIfNoCase User-Agent ".*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT
        Order allow,deny
        Allow from all
       deny from env=BADBOT
</Directory>

  这样修改后需要重启Apache。别人列出的需要屏蔽特征:

  • FeedDemon             内容采集
  • BOT/0.1 (BOT for JCE) sql注入
  • CrawlDaddy            sql注入
  • Java                  内容采集
  • Jullo                 内容采集
  • Feedly                内容采集
  • UniversalFeedParser   内容采集
  • ApacheBench           cc攻击器
  • Swiftbot              无用爬虫
  • YandexBot             无用爬虫
  • AhrefsBot             无用爬虫
  • YisouSpider           无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
  • MJ12bot               无用爬虫
  • ZmEu phpmyadmin       漏洞扫描
  • WinHttp               采集cc攻击
  • EasouSpider           无用爬虫
  • HttpClient            tcp攻击
  • Microsoft URL Control 扫描
  • YYSpider              无用爬虫
  • jaunty                wordpress爆破扫描器
  • oBot                  无用爬虫
  • Python-urllib         内容采集
  • Indy Library          扫描
  • FlightDeckReports Bot 无用爬虫
  • Linguee Bot           无用爬虫
     

  继续补充:

WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot

  还有:

Aboundex
80legs
^Java
^Cogentbot
^Alexibot
^asterias
^attach
^BackDoorBot
^BackWeb
Bandit
^BatchFTP
^Bigfoot
^Black.Hole
^BlackWidow
^BlowFish
^BotALot
Buddy
^BuiltBotTough
^Bullseye
^BunnySlippers
^Cegbfeieh
^CheeseBot
^CherryPicker
^ChinaClaw
Collector
Copier
^CopyRightCheck
^cosmos
^Crescent
^Custo
^AIBOT
^DISCo
^DIIbot
^DittoSpyder
^Download\ Demon
^Download\ Devil
^Download\ Wonder
^dragonfly
^Drip
^eCatch
^EasyDL
^ebingbong
^EirGrabber
^EmailCollector
^EmailSiphon
^EmailWolf
^EroCrawler
^Exabot
^Express\ WebPictures
Extractor
^EyeNetIE
^Foobot
^flunky
^FrontPage
^Go-Ahead-Got-It
^gotit
^GrabNet
^Grafula
^Harvest
^hloader
^HMView
^HTTrack
^humanlinks
^IlseBot
^Image\ Stripper
^Image\ Sucker
Indy\ Library
^InfoNaviRobot
^InfoTekies
^Intelliseek
^InterGET
^Internet\ Ninja
^Iria
^Jakarta
^JennyBot
^JetCar
^JOC
^JustView
^Jyxobot
^Kenjin.Spider
^Keyword.Density
^larbin
^LexiBot
^lftp
^libWeb/clsHTTP
^likse
^LinkextractorPro
^LinkScan/8.1a.Unix
^LNSpiderguy
^LinkWalker
^lwp-trivial
^LWP::Simple
^Magnet
^Mag-Net
^MarkWatch
^Mass\ Downloader
^Mata.Hari
^Memo
^Microsoft.URL
^Microsoft\ URL\ Control
^MIDown\ tool
^MIIxpc
^Mirror
^Missigua\ Locator
^Mister\ PiX
^moget
^Mozilla/3.Mozilla/2.01
^Mozilla.*NEWT
^NAMEPROTECT
^Navroad
^NearSite
^NetAnts
^Netcraft
^NetMechanic
^NetSpider
^Net\ Vampire
^NetZIP
^NextGenSearchBot
^NG
^NICErsPRO
^niki-bot
^NimbleCrawler
^Ninja
^NPbot
^Octopus
^Offline\ Explorer
^Offline\ Navigator
^Openfind
^OutfoxBot
^PageGrabber
^Papa\ Foto
^pavuk
^pcBrowser
^PHP\ version\ tracker
^Pockey
^ProPowerBot/2.14
^ProWebWalker
^psbot
^Pump
^QueryN.Metasearch
^RealDownload
Reaper
Recorder
^ReGet
^RepoMonkey
^RMA
Siphon
^SiteSnagger
^SlySearch
^SmartDownload
^Snake
^Snapbot
^Snoopy
^sogou
^SpaceBison
^SpankBot
^spanner
^Sqworm
Stripper
Sucker
^SuperBot
^SuperHTTP
^Surfbot
^suzuran
^Szukacz/1.4
^tAkeOut
^Teleport
^Telesoft
^TurnitinBot/1.5
^The.Intraformant
^TheNomad
^TightTwatBot
^Titan
^True_Robot
^turingos
^TurnitinBot
^URLy.Warning
^Vacuum
^VCI
^VoidEYE
^Web\ Image\ Collector
^Web\ Sucker
^WebAuto
^WebBandit
^Webclipping.com
^WebCopier
^WebEMailExtrac.*
^WebEnhancer
^WebFetch
^WebGo\ IS
^Web.Image.Collector
^WebLeacher
^WebmasterWorldForumBot
^WebReaper
^WebSauger
^Website\ eXtractor
^Website\ Quester
^Webster
^WebStripper
^WebWhacker
^WebZIP
Whacker
^Widow
^WISENutbot
^WWWOFFLE
^WWW-Collector-E
^Xaldon
^Xenu
^Zeus
ZmEu
^Zyborg
Acunetix
FHscan

  临时屏蔽(返回503错误),而不是长期屏蔽的代码:

RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* - [R=503,L]

转自   https://jamesqi.com/%E5%8D%9A%E5%AE%A2/%E8%AF%86%E5%88%ABUser_Agent%E5%B1%8F%E8%94%BD%E4%B8%80%E4%BA%9BWeb%E7%88%AC%E8%99%AB%E9%98%B2%E9%87%87%E9%9B%86