Friday, May 12, 2006

My Family is CRAWLING?

Call me confused, but why is MyFamily.com crawling my site?

Yup, the genealogy people, that's them.

66.43.16.199 [nat.myfamilyinc.com] has requested 110 pages as Mozilla/4.0 (compatible; MyFamilyBot/1.0; +http://www.myfamilyinc.com)
Even claims to be them, looked at robots.txt like they were supposed to, yada yada.

Still went home with a pile of crap instead of my pages.

Put that on your family tree and smoke it.

John has an interesting theory

I'm not sure if any of you have ever seen John Gabriel's Greater Internet Fuckwad Theory but take a gander, quite amusing.

Just in case one of you thinks "Well, that explains Bill in a nutshell!" then think again as I've always been very vocal, blunt, and to the point.

Doesn't mean I'm not a fuckwad, but the internet had nothing to do with that!

Snared a Layered proxy web host

Yup, that's right, someone decided we need to hide on the net so much they actually created a Proxy Web Host and it appears scrapers are using them, what a shocker.

I found them as AdSense was trying to crawl thru the proxy:

BAD_AGENT: 72.232.31.226 [prx1.proxywebhost.com.] requested 3 pages as "Mediapartners-Google/2.1"
So who are these people using for their provider?
Layered Technologies
72.232.0.0 - 72.232.255.255
That's a HUGE block of IP's to just block out of hand, so how much abuse has been coming from this range? Let's search on "72.232." and see what pops up.

First, it appears I already banned a c-block over there running a multi-IP scraper trying random user agents:
BANNED=72.232.67.222 yrkqi3jrmnbrsk3mUpnrwung
BANNED=72.232.67.219 vhsflbuwvLsbwmyp8xse8hvpdpplLdxdx
BANNED=72.232.67.219 utgkm gylmugtdblyppqqu
BANNED=72.232.67.220 pt tkglswaqatq k rfxqolbtqbygxlhvS0qqv
BANNED=72.232.67.221 djpqaegrbxpfbqnkxvqeniqfogyb rnt
BANNED=72.232.67.221 wbdprvjiqbw jbsvqse7
BANNED=72.232.67.220 upehrsqqqevdljtwrgkkbthk e
BANNED=72.232.67.220 sjxgohdtum3yybmbfyembisbxibei
BANNED=72.232.67.219 7jrxquabdwlgn wyjnoxtyxdryvffjbVdjw
BANNED=72.232.67.219 umesjoxmwrwdvjeqmfsreYfenxqel6d
BANNED=72.232.67.219 kdxiqiyu3yicfupymhimbp nlb v oghtqre
BANNED=72.232.67.222 henlvvdiranneq0cddlfdiXeivbwylon bxic
BANNED=72.232.67.222 vdpPPvxlkwmwpPyy8gpshni8y dwe q8lewlhfl
BANNED=72.232.67.219 didII6ye6It wermhvcx 6jmwcblyxj
BANNED=72.232.67.219 jevltpcioxefrooirvcumd
BANNED=72.232.67.222 r8nawcyepuDfymmbdi8xdsfah8sfqkwhuy eu
Why were they banned?

On a different day they claimed to be this:
72.232.67.221 "FAST-WebCrawler/2.2.5 - Lycos/Alltheweb/Fast"
72.232.67.219 "FAST-WebCrawler/2.2.5 - Lycos/Alltheweb/Fast"
72.232.67.220 "FAST-WebCrawler/2.2.5 - Lycos/Alltheweb/Fast"
72.232.67.222 "FAST-WebCrawler/2.2.5 - Lycos/Alltheweb/Fast"
Reverse DNS claimed it was galaxy-webhosting.co.uk which is in Layered's IP block.

Now, for your amusement, here's the same IP within an hour trying more than one user agent:
00:14:06 72.232.67.222 "FAST-WebCrawler/2.2.5 - Lycos/Alltheweb/Fast"
00:52:07 72.232.67.222 "plblilwkchhs2qfkv rbXbgveu xsxwsxauspuX"
Hey, if one user agent doesn't work, spin the roulette wheel, right?

Sorry idiot, NO user agents work on my site, so let's move along.

OK, at least this one was creative, someone decided to explain it was a User-Agent:
72.232.77.34 "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0;"
Just a few garden variety scrapings:
72.232.52.58 "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)"
72.232.185.170 "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
Then another random attempt on .58 from above:
72.232.52.58 "qyerqcuylypknmarpuoudyeawwft"
72.232.52.58 "blcrnhbulYypqfbtasciqc"
72.232.52.58 "pwifascemkr4abimihq4ybhbusv"
72.232.52.58 "hbcudylrrturxtxwtMhoqq9sMsr uw pfM"
72.232.52.58 "brn jxvcgitdurvqhivtrhthtknu"
72.232.52.58 "fxddbq qxduqghdpbdgnptqrCtioive"
72.232.52.58 "jni0 kjn0flJxuenr0oek0b0rpjx"
It's just so cute that they fucking don't get it, random user agents or valid user agents, you just keep knocking but you can't come in and play so piss off.

Another proxy event that I banned:
72.232.20.146 "Mediapartners-Google/2.1"
Legit spiders crawling outside their range just scream "BLOCK ME! PROXY!", gotta love it.

Then this idiot thought no user agent would work, WRONG!
72.232.13.2 ""
... and a bunch more IPs doing stupid shit, but I'm too lazy to list 'em all here

Word to the wise, it looks like a scraper haven over there so consider blocking it.

According to their web site it looks like all server hosting so probably safe to block the whole range, but they have provided some amusement with their vaudeville scraper show thus far so maybe I'll just keep an eye on them for now and see if they come up with something new to toss at the bot blocker.

Circle Jerk Scraping

Not sure what's going on with some of the scrapers exactly but it seems my "CHALLENGE" page, which has about 5 links on the page, is sending them in circles asking for the same pages over and over.

Lately I'm seeing a lot of lost scraper bots doing something like this:

GET /
GET /legal.html
GET /privacy.html
GET /about.html
GET /
GET /legal.html
GET /privacy.html
GET /about.html
GET /
GET /legal.html
GET /privacy.html
GET /about.html
GET /
GET /legal.html
... and so on and so forth...

Sometimes they'll ask for hundreds of pages like this before breaking out of that loop.

Quite amusing that I seem to have put a kink in their gears.

Thursday, May 11, 2006

Scraping is fucking COMCASTIC!

That's right, most of the assholes abusing my website are using COMCAST!

Let's explore what happened today:

  • CHALLENGE: 68.44.91.83 [c-68-44-91-83.hsd1.nj.comcast.net.] requested 229 pages as "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
  • SPIDER: 70.88.200.205 [70-88-200-205-measured-progress-ne-ma.hfc.comcastbusiness.net.] requested 75 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
  • CHALLENGE: 68.83.187.18 [c-68-83-187-18.hsd1.nj.comcast.net.] requested 59 pages as "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
  • CHALLENGE: 24.63.11.72 [c-24-63-11-72.hsd1.ma.comcast.net.] requested 25 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
  • CHALLENGE: 24.0.49.82 [c-24-0-49-82.hsd1.tx.comcast.net.] requested 44 pages as "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+98;+Win+9x+4.90)"
  • SPEED: 68.53.108.148 [c-68-53-108-148.hsd1.tn.comcast.net.] requested 477 pages as "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"
Let's explain what the reasons were why they were blocked which is the FIRST WORD shown before their information.

CHALLENGE means that after 20 pages downloaded there was a challenge put out that a human could answer but the computer didn't, and it just kept on crawling getting challenges page after page.

SPIDER means they stepped in a spider trap or read robots.txt, something humans don't do.

SPEED means just that, the pages were downloaded faster than Superman could click on them, even with a Kyptonite hangover.

Add it all up and thats 909 pages of COMCASTIC scraping!

That's 1/3 of the abuse to my website coming from a single internet service provider.

Fucking lovely, is it Scraping OnDemand too or is it High Definition Scraping?

Just take your fiber optic cables and give yourself a colonoscopy.

Omni-Crawler now USEFUL?

I was just looking around to see if there was any change in the status of a few old crawlers that never seemed to do anything useful, and suddenly Omni-Crawler's web site claims they actually have a customer!

Holy crap, someone uses it?

They say that their little crawler fuels Vast which looks like a pale Oodle wannabe.

Well guess what?

I don't have any classifieds on my site so leave the shields up set omni-crawler on condition red.

Would hosts care?

What if every time you had an attempted scraping your bot blocker sent an email detailing the specifics to the abuse dept. of the company hosting the scraper?

Would they even care if it weren't for the sudden flood of automated messages?

I'm thinking it might be amusing to give it a shot and see if they block any of these idiots.

Wednesday, May 10, 2006

Scraping My Big Pipe

And I'm not talking about shaving my cock either, I'm talking about some scrapers running via bigpipeinc.com. From their web site it all looks like pro pipelines but I'd hate to whack an entire network just because of one business connection behaving badly.

Anyone else know anything about these guys?

Can't disable it even to upgrade, this is CRAZY!

To easily performs some upgrades I took my bot blocker offline for a couple of hours. This way I could update the tracking database and software without worrying about accidentally killing the site. Then I completed installing the latest changes and turned the bot blocker back on and the site was running slower than shit.

My first reaction was panic that I'd busted something big time but my local version was faster than hell, so I looked into the current server accesses .... holy crap!

Guess what, while the bot blocker was offline for less than 2 hours, some asshole came and overloaded the fucking server when I wasn't watching. The running task list was enormous, the number of concurrent page requests was HUGE!

I punted the asshole's current tasks off the server and the bot blocker took over from there and restored calm to the website.

Seriously though, it's pretty scary when I can't even disable the bot blocked for a brief time without all hell breaking loose. Kind of reminds me of someone I knew that got hacked in less than 5 minutes while trying to replace a firewall with the machine still stupidly connected to the net.

Guess next time I'll either have to work faster or come up with an update plan that has no down time.

More work, less blogging, less forums

Yup, it had to happen, and people obviously noticed as I'm getting PM's here and there "nice to see you posting again" or "where have you been hiding?". Not like I ever went away, just doing work on trying to make my stupid bot blocker actually see the light of day and doing some R&D with a couple of new concepts that have been very promising.

So if you think I'm slowing down, it's just the opposite, as things are seriously HEATING UP!

If you catch me posting too much, smack me and tell me to get back to work ;)

Tuesday, May 09, 2006

WARNING - User Agent Script Vulnerability

Don't know if anyone has ever seen this script vulnerability before but it just happened to me and I almost shit myself.

Who would ever consider that pulling up your log analyzer summary page could fuck you over?

Luckily for me, it was just a simple redirect:

203.214.25.21 "<SCRIPT>window.location='http://www.syncrisis.com'</script> (compatible; MSIE 6.0; Windows NT 5.1; SV1; iP-CMPQ-2003; .NET CLR 1.1.4322)"
These fuckers better watch out cause this is fucking BULLSHIT and you're on my shitlist now you cocksuckers.

Something new to add to the bot blocker and just bounce assholes that do this shit.

Monday, May 08, 2006

First sighting of predicted next evolution in bots.

Remember a while back I predicted that bots would evolve to use random IPs and random user agent strings and it's already happening. Good thing I had already planned for this contingency and was ahead of the curve waiting for them when they tried this trick.

Not 100% positive, but it looks like it all kind of started with this very prolific bot called T8Abot/v0.0.7-beta (3724461@gmail.com) which seems to get around.

The most I can find about it without wasting too much time is this reference:

unknown bot, hosted by FDC Servers, fdcservers.net, US. Massive operation using many IP addresses (66.90.110.199 ... 66.90.110.254)
That bot seems to have gone away, at least it doesn't visit my site anymore, maybe they noticed it was being blocked by user agent string.

However, it didn't go away and they have changed their tactics as I'm seeing the following [small sample] from the same IPs:
06:44:13 66.90.95.225 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
06:44:32 66.90.95.249 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; InfoPath.1)"
06:45:44 66.90.95.243 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
06:45:49 66.90.95.249 "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
06:45:53 66.90.95.249 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
06:45:54 66.90.103.69 "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1"
06:45:59 66.90.95.225 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Opera 7.54 [en]"
06:46:07 66.90.95.245 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; InfoPath.1)"
06:46:10 66.90.95.245 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; InfoPath.1)"
06:46:19 66.90.103.69 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Media Center PC 3.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"
06:46:27 66.90.95.245 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.50"
06:46:28 66.90.95.243 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
06:46:37 66.90.95.245 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; InfoPath.1)"
06:46:39 66.90.103.69 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
06:46:40 66.90.95.216 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
06:46:41 66.90.95.211 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 7.60"
06:46:43 66.90.95.211 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Opera 7.54 [en]"

Lots of requests per minute coming from a dedicated hosting operation, doesn't appear to be dial-ups or broadband, and the same IPs used to claim to be T8Abot but I think they unleashed this random IP/random user agent that they thought would fly under my radar.

Too fucking bad, I built a better mousetrap, you STILL got caught, try again.

I would block everything in the fdcservers.net range: 66.90.64.0 - 66.90.127.255

The important thing with this particular excercise that my long term profiling strategy is paying off by extended profiling of bad neighborhoods. Once IPs start to set off a few traps it becomes obvious where the problem children are as they unwittingly expose themselves.

Scrape-a-thon endurance test

Just for giggles I tweaked the parameters of the bot blocker a few weeks ago to see if I could snare any super slow crawlers trying to fly under the radar.

So far, the grand prize winner has been scraping a few pages daily for 13 days now for a grand total of 1567 pages.

What a loser, sheesh.

Unfortunately, it looks like this latest change has snared about 20-30 innocents so we'll be retooling it today, but I learned a lot about slow scrapers during this process and it was quite an eye opener.

Sunday, May 07, 2006

DDoS against rogue bots and scrapers!

Don't know if anyone has been following the trials and tribulations of Blue Security and their anti-spam tool the Blue Frog with over 400K members, but their technique could be useful against scrapers. Basically what Blue Frog does is automatically send opt-out requests to spammers from every member in their network which overloads the spammers servers and sends them crashing or maybe even blocks their ability to spam while the pipeline is choked with inbound requests.

So this concept stewed in my brain for a couple of days and suddenly it occurs to me that bots could be stopped the same way with a large enough network of anti-bot members. The concept is simple in that any active crawlers caught in the act could simply be pinged to death for a brief period to stop them with brute force. That's right, we could aim thousands of servers at unwanted active crawlers and literally DDoS them into leaving us alone.

Worse case, they aren't crawling when their pipeline is choked.

Best case, they take a hint and stop hitting member sites.

That method in itself COULD make an entire community of networked webmasters deal death blows to scrapers, blog spammers, and all sorts of nasty vermin.

Hell, why not fight fire with fire and use the tools of the underground against them?

If you can locate a scraper's site we could also just deploy spiders to simply crawl them offline. That's right, do the same thing they do and crawl them so hard and fast the damn server can't even respond to requests. Hit them with so many different servers at the same time they can't even identify who's doing it in time to stop it.

Remember, if visitors can't get to your web site then the purpose of scraping to build the website becomes meaningless.

Here's where I think this strategy might pay off as scrapers on shared servers just might get the boot when the host figures out that they're the reason for the attacks. Likewise, colo facilities might even boot dedicated servers when a network of unhappy webmasters retaliate and choke a major hosts data pipe repeatedly.

Does is sound like vigilanteism?

HELL YES IT DOES!

Why shouldn't we do it?

Nobody else is helping us as neither their web hosts, the copyright laws, Google AdSense nor Yahoo Publisher Network seem to be interested in helping us stop this plague so maybe it's time to deploy full blown internet warfare tactics like Blue Security in order to stop the madness!

In the immortal words of Mel the cook on Alice:
"The best DEFENSE is a good OFFENSE!"