Friday, April 21, 2006

Where are all the kick ass PHP programmers?

Now that I'm ready to actually build my bot blocker for commercial purposes it seems like it's next to near impossible to find any quality PHP programmers, at least none that aren't already booked thru 2010, to help build the damn thing.

For those thinking it's going to be some simple piddly-assed fire-and-forget script think again as we're talking a product of much larger scope, including a centralized server component. The scope is more that just the "bot blocking" that I rant about here as it will be a complete crawler management and protection system. It will allow webmasters to manage access to their site without having to deal with the technical aspects, nor keep up with the latest crawlers. Additionally, there is a security component that will harden web sites against a variety of exploits.

I was prepared to just pay someone to convert it to PHP, or even be an equity partner if they wanted some of the back end action, but it's starting to look like I may just have to roll up my sleeves and do it all myself. That's going to throw a nasty monkey wrench into the time frame as there is still pending R&D that needs to be done.

So much to be done, so little time, and a planned summer BETA release not looking so good at the moment.

Monday, April 17, 2006

Blocked Spiders DO NOT Go Away

There are a few bold, albeit naive, statements by other so-called "bot blockers" that scrapers just go away after you deny them a few pages which is complete and utter BULLSHIT!

Some of the scrapers being blocked on my server have been set to BANNED for months now, haven't gotten a single page of value, yet they just keep coming over and over, attempting to get pages they remember regardless of the outcome.

Most bot blockers I've reviewed just set speed traps or page limits and then throw a captcha in their face to make them go away for a brief period of time, maybe a few hours, maybe a day or two, but many of them will come back over and over and get another chunk of pages when they return. The stakes are high and the scrapers want your content badly so putting silly little bandages on your website for short term solutions do not cure the long term problems.

The only way to truly stop them is to profile their behavior over time as my bot blocker throws all first-time suspected bot IP's into QUARANTINE. Once an IP makes it into quarantine they are immediately suspended for 24 hours and then challenged immediately when they return to the web site after 24 hours. This stops repeat offenders from getting any pages whatsoever when they return and also protects against permanently blocking a DHCP address by accident that is used to scrape only once. After a couple of repeated scrape attempts without breaking thru the challenge, which a human can easily do, the site is escalated from quarantine to BANNED which no longer presents challenges and just gives error messages on repeat visits.

Not rocket science but it has a lot more finesse than some of the more simplistic methods others employ and better hardens the site against repeated attempts at scraping.

Sunday, April 16, 2006

Gosh Goes Wild

I blocked one IP address and they came back with a new IP address 72.51.37.210 doing their sneaky masked crawl bullshit.

Go away little crawler, stay away little crawler, or this WILL get ugly when I start cloaking 'YO MAMMA' snaps into your fucking search engine for major keywords like: "Pool Cleaning Services - Yo Mamma so stupid she saw shit in the pool and thought the anal porn queen was trying to teach her baby to swim!"

BTW, fix your dumb fucking crawler so it can properly parse a goddamn web page as hitting my server for "GET / <font" is about as bright as asking for 30 pages in a minute while claiming to be "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

Stupid fucking shit really gets on my nerves.