Friday, December 02, 2005

Stopping Site Scrapers with a Honeypot

Running with some ideas I got from Brett on WebmasterWorld my roque bot trap is almost complete. His completely disabling robots.txt actually gave me a great idea as the rogue bots don't bother following the robots.txt rules in the first place so I've been working on some ideas to automatically identify and block rogue bots by setting multiple types of honeypot traps in the web pages that well behaved bots will ignore.

The basic concepts I'm using to catch rogue bots and scrapers are simple:

  • Bogus pages installed with fake links for bots to follow that well behaved bots will ignore
  • Tracking frequency of page access to detect rapid downloads
  • Tracking total pages accessed in a 24 hour period to detect a volume of downloads
  • Some secret herbs and spices I won't divulge
Sadly, nosy people looking at my web source code and accessing the honeypot pages may automatically ban themselves from my server but that's just the risk curiosity may cost.

The neat thing is I don't have to put a permanent ban on the rogue bots that get trapped as they'll just fall into another honeypot when they return or switch to a different proxy server.

This should be a very interesting experiment and I'll keep you all posted after testing it for a weekend to see how effective it is - maybe I'll set up a honeypot service if this works!

No comments: