Tuesday, May 24, 2011

Whitelisting, Not Blacklisting to stop bots!

Really getting sick of repeating myself as people just don't seem to get it when it comes to blocking bots that blacklisting doesn't fucking work.Blacklisting requires wasting time chasing bots in access logs, huge ass .htaccess files that slow Apache and impact server performance, and are easily bypassed with changing a single character in the user agent name.

Whitelisting on the other hand only tells the server what can pass, everything else bounces. Whitelists are usually short, Googlebot, Slurp, Bingbot, valid browsers, and nothing else, a fast list to process which doesn't slow Apache down whatsoever.

Then install a script to monitor for things that access robots.txt, spider trap pages, natural spider traps like your legal and privacy pages, plus speedy or greedy accesses, and you've pretty much solved you scraper problems.

But for fuck's sake, use your goddamn brain and WHITELIST or you're just wasting your fucking time and inviting scrapers, not blocking them.

2 comments:

Bruce (PMToolsThatWork.com) said...

I've been looking around for something that helps implement whitelisting.

The only thing that looks close is robotcop (robotcop.org) but it also about 10 years old. I've not tried it out yet.

Any other references?

Anonymous said...

Problem with whitelisting is how would you implement it? Via your domains .httaccess or server wide via Apache?

Or, could you use ModSecurity to 406 every not Google, Bing, etc request.

We have a ton of requests from MSIE 5.0 useragents that are really just scrappers.

How would you white list only recent browsers from Chrome, Safari, FF effectively?

I have not seen an realistic example to do an effective whitelist, if so, could you point us to one as an example?