Saturday, July 22, 2006

Scrapers Switching Bot Software

Here's another example of a scraper that switched software. This time the scraper went from something using random user agents to some other bot that's detectable because of the "compatible ;" flaw, note the space before the semicolon, in the user agent making it easily blockable.

07/13/2006 12.30.91.69 "4jkjhm4lgkualkihgfm nSriexn"
07/13/2006 12.30.91.69 "gu EeoiEmwbrqufoktaitgirh3sso3Ex"
07/13/2006 12.30.91.69 "yjiu k4smuqlvj4qreip l em4vjngywrv"
07/13/2006 12.30.91.69 "tfdgevSbyefsSoevhrrr"
07/14/2006 12.30.91.69 "c xqxrwgt kfrod oUwxmqxbooewtUrxgUplr"
07/14/2006 12.30.91.69 "agnesmynnlihdiunsutxxn5skoY5jsgmx"
07/14/2006 12.30.91.69 "aqtcc2dfklcrdymQQlhqclpcx2km2"
07/14/2006 12.30.91.69 "nBnwsmracuwd7ovdnmgnora"
07/14/2006 12.30.91.69 "vuqWfxtekvxi8relwfx8ejrto"
07/14/2006 12.30.91.69 "9fxebywuwjbpdbfnesfvpqygondkiqtrfdkaskj"
07/14/2006 12.30.91.69 "mst xfwkpktrkfymy2owm wu2"
07/14/2006 12.30.91.69 "xdgdmncxnhjqrvudftxnyrqwqyfiecdclqpmg"
07/14/2006 12.30.91.69 "5sieornr5ksjimfykxyoimyyfuedthnyuuijeb"
07/14/2006 12.30.91.69 "vhryqjhtkmpysfwhmrfcfotgkkkvQdjvtdgyr h"
07/17/2006 12.30.91.69 "ahBuu0xghlmhxaketqo0jjuyxxqxugilvtciso"
07/17/2006 12.30.91.69 "modJjnbqrprdhbwJcpohw prj4"
07/22/2006 12.30.91.69 "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
07/22/2006 12.30.91.69 "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
This is a prime example of why I advocate tracking where the little scumbags operate opposed to just blocking by user agent as this cable client might upgrade to something even better and slide under my radar next time.

Another one I spotted switched from something in Java to the "compatible ;" bot:
62.194.21.116 Java/1.4.1_04
62.194.21.116 "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)"
I'm sure there's more, still compiling the results, should be interesting.

Friday, July 21, 2006

Huge Yahoo SERP Upheavals or just a Tweak?

The other day I was posting a about my position in Yahoo for the silly term "Lava Pockets" and suddenly today it shot all the way down to #133. I see our old snack food friend Hot Pockets is still holding strong at #20 but they dropped the hammer on my ass.

Could it be they got pissed at my post and gave me a hand job?

Probably not because I started looking around and noticed someone on WebmasterWorld posted today about many blogs appear to have plunged suddenly in Yahoo and others suggested it was a tweak. Not sure it's just a tweak as some of my SERPs not related to any blogs moved up and down all over the place.

Looks like Yahoo has installed the new much anticipated "amusement park" algorithm as that's the only explanation I can find for the constant rollercoaster results this month.

Stay tuned as Yahoo's next algorithm is rumored to be tied to their stock price.

Thursday, July 20, 2006

Robots, Spiders and Dumbass Delphi Clients, OH MY!

Never heard of this one but out of the blue someone in France hit me with TALWinHttpClient which is some Delphi plug-in and the French must be the only ones still clinging to Delphi as nobody worth a hookers time still uses that shit.

86.202.105.134 [ANice-152-1-99-134.w86-202.abo.wanadoo.fr.] requested 1 pages as "Mozilla/3.0 (compatible; TALWinHttpClient)"
Figures, something comes crawling from France and puts up a white flag after just one page.

C'est la vie.

VIVA LA GUERRE DES BOTS!

Double Agent Trapped by Bot Blocker

The title wouldn't be so funny except this double agent originated from the Defense Contract Management Agency (DCMA).

Here's the user agent string that caused the problem:

144.183.226.xxx [xxx.dcma.mil.] requested 2 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 3.3; Windows NT; DCMA 5.5 r.2.0); .NET CLR 1.0.3705; .NET CLR 1.1.4322; InfoPath.1)"
Anyone notice the double agent in the string?

It was the second "Windows NT;" without the version number, which is technically invalid, that caused them to get snared.

Oopsy, military intelligence stikes again.

Just hope whoever did that doesn't contribute code to the missile defense systems!

MetagerBot powered by Exalead - First Look

Looks like I'm on the bleeding edge with MetagerBot as there's almost a void of information about this bot that comes calling from the University of Hannover.

130.75.2.12 [kursix.rrzn.uni-hannover.de.] requested 1 pages as "MetagerBot/0.8-dev (MetagerBot; http://metager.de; )"
The part that I'm confused about is what do they need their own bot for as they're claiming to be a meta search engine powered by Exalead?

Some references indicated they were using [big shock] nutch:
metagerbot/0.8-dev-mnebel-20050827 (mgbot; http://metager.de/; __nutch-agent@metager.de)
Played around with their search for a bit and it did cough up a few new things of interest I hadn't encountered, but it was slower than shit most of the time and waiting on results can be a bit tedious.

So does this mean we should let Exalead crawl?

No clue.

StupidHijacking

Here's a simple example of how the proxy sites hijack listings.

72.51.33.237 [server1.stupidcensorship.com.] requested 2 pages as "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Yes, that was really Googlebot that tried to crawl my server via the proxy hosted on that server.

Amazingly you won't see any links for Google to crawl my site or any other site via this proxy server which is why I always assume there is a cloaked list of URLs being fed to Google as this happens all the time from all sorts of proxy sites of this type.

Cell Phone ALMOST passed the test

With the myriad of cellphone user agents it's almost impossible to track them but this one ALMOST got thru with a minor exception:

The user agent was:

"Dopod818Pro/Mozilla/4.0 (compatible; MSIE 4.01; Windows CE; PPC; 240x320; Dopod818 Pro)"
If they hadn't redundantly put "Dopod818Pro/" on the front it would've passed thru to the website.

Oh well, better luck next time.

ActiveTouristBot - First Look

Welcome to the wonderful world of dumb fucking robots that haven't a goddamn clue.

Maybe ActiveTourist was attempting referrer spam as I have no idea what this digital dungheap was actually doing.

See for yourself:

80.229.187.190 -"GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 -"GET /AnActualFuckingPage.html HTTP/1.1" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /AnotherFuckingPage.html" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /TheFuckOffMyServer.html" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /YourHeadOutOfYourAss.html" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /robots.txt" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /MoreRobotsTxtFilesYummy.html" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /MoreShitYouArentGetting.html" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
80.229.187.190 - "GET /AFuckingClue.html" "Mozilla/4.0 (ActiveTouristBot V1.2 ;http://www.activetourist.com)"
This spastic fucking bot eats robot.txt files like they're candy.

Didn't find any specific information on that tragedy they call a website about what they even support in robots.txt, but it doesn't really matter as they got nothing from me except a place of honor on my blog.


Sunday, July 16, 2006

Hooters Casino Crawling or Breached?

This is the same exact scenario that I saw with the Tireman site and several others.

All dedicated servers, yet the SAME crawler with the exact same behavior is operating from this server:

24.120.182.161 [mail.hooterslv.com.] requested 29 pages as "Java/1.4.2_06"
I don't think these are a coincidence as it's starting to look like a botnet of some sort, perhaps spambot scrapers, no clue what they're looking for at this point.