Monday, May 12, 2008

Impact On Your Bandwidth Will Be Minimal My Ass

How often do we see that happy line of horse shit spread by every new startup that crawls the web about how minimal it's impact will be?

Every fucking one of them claim it but when you add them all together the bot traffic is quickly exceeding the human traffic.

Who the fuck am I kidding, on most sites the bots clearly out number the humans in pages read on a daily basis.

First we put the big search engines on top of the heap with Google, Yahoo and MSN crawling the crap out of your servers daily. Just the three of these guys can easily read as many pages as 10K visitors a day. Then throw in the wannabe search engines like Ask, Gigablast, Snap, Fast, etc. ad nauseam and it's over the top.

Now expand that list to include the international search engines like Baidu, Sogou, Orange's ViolaBot, Majestic12, Yodao, and on and on, tons of 'em.

Then we have all the spybots that feel entitled to crawl your site like Picscout, Cyveillance, Monitor110, Picmole, RTGI, and on and on.

Next add up all the specialty niche bots like Become, Pronto, OptionCarriere, ShopWiki, and all sorts of shit too numerous to mention.

Pile on top of this all the free fucking tools that every little shithead and make believe company uses to scrounge the 'net for god knows what, and god's not telling, like Nutch and Heritrix, plus the web downloaders, offline readers, and more.

Don't forget, many of these so-called search engines and shit now want screen shots as well so after they crawl your page they send a copy of Firefox or something to your site to download every page again plus every fucking image, never cached, over and over and over.

Did I forget to mention directories?

They'll want to link check you and get screen shots as well, don't leave them out or they'll feel fucking neglected.

Wait, there's more, those social sites like Eurekster, Jeteye, etc. that let people link to your shit and then come back banging on your site all the time to make sure that shit's still valid.

Then add up all the RSS feed readers and aggregators that pull down your RSS feeds that nobody ever fucking reads. Not to mention the RSS feed finders like IEAutodiscovery that run amok on your site just looking for RSS feeds ... FUCK!

If you run affiliate programs you have CJ quality bot or some shit hitting your site and if you run ads then the Google quality bot, it's always something.

Don't forget the assholes running the dark underbelly of the web with all the scrapers, spam harvesters, forum, blog and wiki spammers, botnets and other malicious shit pounding on our sites daily.

Add on top of all this shit Firefox, Google Web Accelerator and now AVG's toolbar all pre-fetching pages that will most likely never be read and holy shit, we're being swamped!

OK, now that we've identified all this bot traffic, where's all the fucking people?

Of course you think all those hits from MSIE and Firefox are people, right?

Hell no!

Are you out of your fucking mind?

Those hits are the scrapers, screen shot makers and companies like Cyveillance and Picscout that don't want you to stop them from crawling your site so they just pretend to be humans to get past the bot blockers.

Well guess what?

There are no fucking people on your site. the internet is now run for and used exclusively by bots.

Apparently you missed the memo.

Comparing Effectiveness of Anti-Virus Web Protection Methods

There's three basic methods being used at the moment to protect web surfers from potential dangers which are static (stale), active and passive.

Static Web Protection

Various companies use the static method which relies on crawling the web in advance to find vulnerabilities and then attempt to warn visitors about these problems as they are about to visit a web site. McAfee's SiteAdvisor and Google both take this approach and it's obviously only as good as your last scan and the malware can easily be cloaked and hidden from these somewhat obvious crawlers. Besides easily being fooled with cloaking, the data is always stale meaning sites good even 10 minutes ago could now be infested with malware and sites previously infested could have been cleaned.

This method isn't optimum for anyone and can be a nightmare for websites tagged as bad to get off the warning list assuming they ever find out they're on it in the first place as their business goes down in flames from traffic going elsewhere.

Active Web Protection

The latest AVG 8 includes a Link Scanner and AVG Search-Shield which aggressively checks pages in Google search results that you're about to visit in real time to help protect the surfer. Unfortunately, AVG made several mistakes, some that could be deemed fatal flaws, which allows this technology to be easily identified so that malware and phishing sites can easily cloak to avoid AVG's detection. Even worse for webmasters is that AVG pre-fetches pages in search results and as adoption of this latest AVG toolbar increases, it is quickly turning into a potential DoS attack on popular sites that show up at the top of Google's most popular searches.

While I think AVG's intentions were good, their current implementation easily identifies every customer using their product and causes webmasters needless bandwidth issues that could be easily resolved on their part with a cache server.

Passive Web Protection

The method used by Avast's Anti-Virus is to use a transparent HTTP proxy meaning that all of your HTTP requests pass through in invisible intermediate proxy service that scans for potential problems in the data stream in real-time. The data is always fresh, checked in real-time, the user agent doesn't change and there are no pre-fetches or needless redundant hits on websites.

The only downside is you don't know the site is bad in advance but that can easily be the case with static protection due to stale data and/or cloaking and active protection due to cloaking.


The Best of All


While the three approaches all have their potential problems it appears a combination of all three is probably the best approach.

Bad Site Database:
The SiteAdvisor/Google type database approach is good to log all known bad sites so they don't get a second chance to fool the other methods with cloaking once their are caught. This cuts down on redundantly checking known bad sites until the webmaster cleans it up and requests a review to clear their site's bad name.

Perhaps the Bad Site database concept needs to become a non-profit dot org so that all of the anti-virus companies can freely feed and use this database without all the corporate walls built up around the ownership of the data for the greater good, something like a SpamHaus type of thing or perhaps merged into SpamHaus.

Optimized Pre-Screening:
The AVG approach of pre-screening a site could be optimized by fixing the toolbar's user agent so it's not detectable and use a shared cache server to avoid behaving like a DoS attack on popular websites. The beauty is that the collective mind of all these toolbars with an undetectable user agent avoids the cloaking used to thwart detection associated with known crawlers. If the toolbar fed the results of these bad sites to the Bad Site Database, then there's a win-win for everyone.

Transparent Screening:
The final approach used by Avast should still be performed which is the HTTP proxy screening to that any site that manages to not be in the bad site database and still eludes the active pre-screening of pages, would hopefully get snared as the page loads into the machine.

Summary

When you pile up all of this security the chances of failure still exist but the end user is protected and informed as much as humanly possible from all of the threats present.

It would certainly be nice to see some of the anti-virus providers combine their efforts as outlined above to make the internet a safer place to visit.