Thursday, November 10, 2011

Bot Blacklisting vs. Whitelisting, are you a convert yet?

I'm still shocked that after all these years people not only are still practicing the ancient black art of blacklisting, but I'm even still more shocked to see several so-called website content security products being recently released that rely on blacklisting as their primary defense.

Are they fucking kidding?

Do people really pay good money to chase an endless supply of bots?

Let's explore the blacklisting dam vs. whitelisting dam metaphor to get a simple grasp on this issue. For those not familiar with the problem, blacklisting is like building a dam on a river with a big gaping hole in the middle that damn dam. While holding back some water, or bad bots in this instance, the damn blacklisting dam still lets most of it spill through, total waste of time and money. Whitelisting on the other hand is like a real dam that holds everything back except for the controlled spill, aka the whitelisted items, which are the only things allowed to pass. Therefore, just like damming a river, common sense dictates you build a solid dam with whitelisting to control all those bots and do it right the first time.

Blacklising is a pretty futile methodology, obviously the choice of masochistic webmasters. Look at the amount of time and resources wasted maintaining a blacklist. Tons of bot entries, lots of log analysis and processing power just to keep up with them. Heck, all the bad bots have to do to defeat your blacklist is change their user agent name every single time they access your site.

Simply combine any two random words in the dictionary and you've just got a new bot name that can bypass any blacklist. Hell, just pick almost any single word from the dictionary and you'll defeat the blacklist, two words is overkill really. Some bots merely send a couple of random strings of gibberish as a user agent which works perfectly to defeat silly tactics like blacklisting.

Now examine the simple implementation of a Whitelist. There aren't that many beneficial things that crawl your site and most sites can thrive with a whitelist of less than 20 entries, maybe 100 max. instead of the hundreds or thousands of items in a blacklist. Small lists, easy to maintain, and negligent processing required to validate the list in real-time, low impact on server load.

Using any raw logfile analysis program it's easy to identify what should be whitelisted in mere minutes. Best thing is that whitelisting means you can spend your spare time actually working on your site instead of chasing bad bots to blacklist as everything not whitelisted is automatically kicked to the curb by default with no extra effort on the part of the webmaster.

Those that I've actually convinced to convert to whitelisting in the past have done nothing but sing it's praises.

Compare that to those still blacklisting, they don't have any spare time to sing.

Tuesday, November 08, 2011

Tracking Domain Intel Site Bots

I've taken a recent interest in tracking down the shitload crawlers of the domain sites out there that scrape your homepage, keywords, etc. and even display your AdSense and Google Analytics IDs.

Fucking asshats.

Posted a bunch of them on WebmasterWorld so drop in there to get details about domainsoutlook.com, statshow.com, urbandata.com, zitetrendz.com, hostnology.in, dawhois.com, clearwebstats.com, whoare.us, whoare.us, w3who.net, diigo.com, domainspyer.com, spyrush.com, aboutthedomain.com, seeallweb.org, webdetail.org and a minor update on domaintools.com.

Yes, I got busy :)

Wouldn't mind some feedback on my post about Spider Tracking Links - Examining 2 Methods - that would be nice!