Friday, April 04, 2008

Discovery Engine's Discobot Discovered My Bot Blocker

I found this little Discobot from Discovery Engine trying to dance around on my server but the bot blocker bouncer at the door was already keeping him behind the velvet ropes.

Here's a sample of what I saw on my site:

208.96.54.74 "GET /robots.txt"
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"

208.96.54.68
"Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html)"
It does honor robots.txt just like they said it did but it cached it for about 48 hours between visits.

They were nice enough to provide the range of IPs it uses:
208.96.54.67 - 208.96.54.96
Those IPs are from Servepath which I already block.

Between whitelisting allowed bots and blocking more data centers then I'd care to admit, this poor little Discobot didn't stand a chance to discover anything.

Call back when you're all grown up and ready to send traffic.


Persaibot - The Rude Crawler

I saw this little Persaibot hit my site today without even looking at robots.txt and their website has the balls to say:

Persai uses this bot to crawl the web. It's probably the nicest bot with the greatest personality in the world. Seriously, give it some attention.
Exactly how nice can a bot be that doesn't read robots.txt?

Did you read it and cache it some other day?

Doesn't matter, that was more than 24 hours ago, read it again.

I checked my logs from yesterday, it didn't read it then either and Persai hadn't visited my site in about a month before that.

I'm sorry, you have made the huge faux pas in robot rudeness.

Here's the intel I have on this little bot:
71.204.131.68 [c-71-204-131-68.hsd1.ca.comcast.net]
"PersaiBot/2.1-dev3a (Persai web crawler; http://www.persai.com/bot.html; bot at persai dot com)"

67.202.55.205 [ec2-67-202-55-205.compute-1.amazonaws.com]
"Mozilla/5.0 (compatible; Persaibot/2.71828183; +http://www.persai.com/bot.html)"

76.102.193.127 [c-76-102-193-127.hsd1.ca.comcast.net]
"Mozilla/5.0 (compatible; Persaibot/2.71828183; +http://www.persai.com/bot.html)"
Now the true irony here is that the CEO of Persai posted on his blog complaining about another search engine called Spock scraping every little bit of data about him but at least Spock claims to honor robots.txt.

Must be a karma thing ;)

DART Agent - Another Annoying Distributed Tool

This little annoying DART thing that keeps bouncing off my web site appears to be written by CRS4, the Center for Advanced Studies, Research and Development in Sardinia.

It would appear DART stands for "Distributed Agent-based Retrieval Tools" and they even have a workshop in '06 about this damn thing touted as "The Future of Search Engines' Technologies" that had people from Yahoo!, Google, Quaero and Ask attending.

Here's a sample of some IPs it operates from and the shitload of versions this thing has:

212.123.91.18 "DART Agent, version 1.2 (build 14062007)"
212.123.91.78 "DART Agent, version 1.2.7 (build 27062007)"
212.123.91.78 "DART Agent, version 1.4 (build 17102007)"
156.148.18.62 "DART Agent, version 1.4 (build 29102007)"
156.148.18.62 "DART Agent, version 1.4.1 (build 05112007)"
156.148.18.62 "DART Agent, version 1.4.2 (build 08112007)"
212.123.91.78 "DART Agent, version 1.4.3 (build 15112007)"
212.123.91.78 "DART Agent, version 1.4.3 (build 19112007)"
212.123.91.78 "DART Agent, version 1.4.4 (build 05122007)"
212.123.91.78 "DART Agent, version 1.4.5 (build 06122007)"
212.123.91.78 "DART Agent, version 1.4.6 (build 14012008)"
156.148.18.62 "DART Agent, version 1.4.6 (build 14012008)"
212.123.91.78 "DART Agent, version 1.4.7 (build 24012008)"
212.123.91.78 "DART Agent, version 1.4.8 (build 04022008)"
212.123.91.78 "DART Agent, version 1.5 (build 08022008)"
212.123.91.78 "DART Agent, version 1.5.1 (build 14022008)"
212.123.91.78 "DART Agent, version 1.5.2 (build 18022008)"
212.123.91.78 "DART Agent, version 1.5.5 (build 27022008)"
156.148.18.62 "DART Agent, version 1.5.6 (build 28022008)"
212.123.91.78 "DART Agent, version 1.5.6 (build 28022008)"
212.123.91.78 "DART Agent, version 1.5.1 (build 14022008)"
212.123.91.78 "DART Agent, version 1.5.7 (build 05032008)"
82.85.70.40 "DART Agent, version 1.5.2 (build 18022008)"
212.123.91.78 "DART Agent, version 1.5.8 (build 06032008)"
156.148.18.62 "DART Agent, version 1.5.8 (build 06032008)"
82.85.70.42 "DART Agent, version 1.5.8 (build 06032008)"
212.123.91.78 "DART Agent, version 1.5.9 (build 19032008)"
212.123.91.78 "DART Agent, version 1.5.8 (build 06032008)"
212.123.91.78 "DART Agent, version 1.5.9 (build 20032008)"
213.205.44.51 "DART Agent, version 1.5.8 (build 06032008)"
213.205.44.52 "DART Agent, version 1.5.8 (build 06032008)"
212.123.91.78 "DART Agent, version 1.6 (build 02042008)"
213.205.44.52 "DART Agent, version 1.5.8 (build 06032008)"
156.148.18.62 "DART Agent, version 1.6.0 (build 02042008)"
Looks like so far it's only operating out of Italy and they're nice enough to provide reverse DNS when it operates off their servers "dartcn01.crs4.it" and even another source "dart02.itsm.tiscali.com" so the crawler could be verified but other sources couldn't be verified such as "82-85-70-40.b2b.tiscali.it" so it's going to be a problem child for anyone that wants to let it play but make sure it's not being spoofed.

Just what the web needs, more distributed web technology to bug the fuck out of webmasters just trying to scratch out a living on the internet.

Oh well, it can't play on my server so what the hell do I care anyway!