Saturday, January 05, 2008

Why The Hell Is Bloglines Crawling?

Let's start this investigation by noting that Bloglines themselves claim to be a crawler now when you use reverse DNS on their IP address: ->
This is what Bloglines is supposed to do, read your RSS feed: "GET /rss_feed.xml" "-" "Bloglines/3.1 (;XXX subscribers)"
However, they've stepped off the RSS path and started coloring outside the lines!

The first off thing I noticed was it asked for robots.txt without any user agent defined: "GET /robots.txt" "-" "-"
So I dug a little deeper and it appears they are running Firefox Minefield which was asking for a bunch of images from 3rd party websites where my graphic appears: "GET /myimage.gif" "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"
Finally, I found them requesting some web pages that are NOT in any RSS feed, what the fuck? "GET /anyoldpage.html" "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"
So, anyone have a clue what they're doing?


Yes, they're making screen shots that appear on!

I looked up a few pages from one of my sites in ASK and sure enough, instead of screen shots of the actual web pages there were screen shots of error messages with the Bloglines IP address of in big bold numbers.

The reason I figured that out so easily was I recently decided to just block everything claiming to be coming from Linux just to see what came up and that's why they got an error page instead of a screen shot. Sure, I'm probably blocking a few innocent Linux users as well but they account for an insignificant part of my traffic and overlap with the same tools that servers use so sacrifices were made.

Anyway, what we've learned is that Ask is using Bloglines' IP to make screenshots and look at your robots.txt file yet they don't disclose what they're even looking for in your robots.txt file.

Wasn't that fun?

Friday, January 04, 2008

Does Covenant Eyes Divulge Their Members?

While monitoring activity from Covenant Eyes on one of my servers it became obvious that many of the pages being accessed were fairly unique, not as popular, and easily allowed me to figure out the actual customer Covenant Eyes was watching.

To test my theory I checked the log file for one unique page Covenant Eyes requested and sure enough only a single IP had accessed that file during the course of the day.

Then I got a list of all files that this visitor's IP had viewed and compared it to all the files that Covenant Eyes requested and it was an exact match in the exact same order of access, without any obfuscation, so it was a 100% match without a doubt.

I've been monitoring this situation for several days now and it's always the same.

The visitor comes and views some pages and about 90-120 minutes later Covenant Eyes comes and asks for the exact same pages in the exact same order.

Here's a sample of a visitor's access: "justapage.html" "anyoldpage.html" "justanotherpage.html" "veryspecialpage.html" "anotherrandompage.html"
A while later Convenant Eye's asks for the same pages in the same order:
69.41.14.x "justapage.html"
69.41.14.x "anyoldpage.html"
69.41.14.x "justanotherpage.html"
69.41.14.x "anotherrandompage.html"
Same pages, same order, definite match with a unique page like "veryspecialpage.html" that nobody else visits on the same day. Additionally, they appear to do each customer's files they monitor very quickly in a batch so it's pretty easy to see that those files are related to a single visitor making identification even simpler.

Now with a simple script I can find out who they were monitoring with extreme accuracy as long as the visitor looked at more than one page unless that one page was unique and nobody else looked at that page during the day.

Making it harder to identify which visitor they're monitoring wouldn't be that difficult just by staggering and randomizing their page requests over the course of the day. However, I still don't see how you could protect the identity of your customer if that was the only customer of the day that accessed that web site unless you throw in a few bogus page requests to throw a webmaster off the trail. Even with randomization and fake page requests you would still have a problem if that customer was the only one to access a specific page as mentioned above, but at least it would be a start in making the monitoring activity just a little more covert and possibly less traceable.

The site of mine where I did this experiment, which isn't this blog, gets from 20K-40K visitors daily, so if I can easily find a needle in that big haystack then it would be trivial on a low traffic site.

Tuesday, January 01, 2008

Romanian Scrapers Go Apeshit on New Years Day

The stealth scrapers attempting to hit my site have been really laid back lately but on Jan 1 '08 the Romanian scrapers went apeshit, or at least tried, followed by a few others.

Needless to say, the bot trap was very busy today.

So far today this is what the little Romanian fuckers tried: requested 333 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" requested 336 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" requested 337 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" requested 336 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
Then someone from Vietnam tried to join the fun: requested 340 pages as "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"
A quick visit from the Ivory Coast: [] requested 339 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"
Then maybe a human with issues...

Someone from Venezuela gave a quick visit with what appeared to be a broken browser that asked for a bunch of pages that the visitor probably wasn't aware happened: [] requested 153 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
Every time the browser would ask for a page it would then ask for the home page about 5-10 times in just a few seconds, what the fuck is up with that?

Anyway, it was considered an automated attack, fuck it.

Anyone else have a wild scrape attack today?

How to Identify Screen Shot Makers

Have you ever wondered how I figure out where screen shots originate from?

My trick of the trade is the SPARE DOMAIN!

All my unused domain does is print out information about whoever or whatever just visited the site with the IP address in REALLY BIG BOLD LETTERS so it's easy to read on a small screen shot thumbnail.

Therefore, if someone makes a screen shot I can tell who's doing it just by looking at the screen shot and block them from doing it a second time if I don't like what they're doing with thumbnails of my site.

DomainTools Whois and AboutUs Site Accesses Revealed

The DomainTools Whois is now collecting and displaying more information than ever about our web sites. Their Whois display used to be limited mostly to public registration information such as Whois, the IP address, where you host and the basics. Then DomainTools expanded Whois a while back and started taking data straight from our domains without permission and doesn't even look at robots.txt to see if we want to participate. The screen shots were no big deal but then they added some SEO text browser that allows people to snoop on your site and who knows what's next.

If that wasn't enough, then along came their Wiki companion site, which scraped off some data as well. AboutUs does seem to use robots.txt, see backwards robots access below, but by the time you find out about the bot it's too late because you already have scraped content on your domain's AboutUs Wiki page.

Enough is enough, it's official, I'm annoyed.

Since I could find no way to "opt-out" of all the new toys on DomainTools Whois I decided it was time to opt-out the old fashioned way and just block 'em.

If they had just identified themselves in the User Agent this would've been easy because those are all monitored on my main site automatically. However, it appears that DomainTools either doesn't know how to put their information in the User Agent field for the tools they use, or they really don't want to get snared and stopped easily, because they use standard Firefox and MSIE user agents for accessing your site.

However, note that the referrer does claim that it's coming from DomainTools so you can at least use that as an indication it's them although the User Agent field would've been preferred since it is the standard for this sort of thing.

Here's a sample of DomainTools SEO Text Browser hitting your server: "GET /" "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20071127 Firefox/"
The SEO Text Browser thing looks like it might be telling the webmaster who's snooping on their site because I caught it claiming to be a proxy that was forwarding information for my IP address when I was looking at the site so using it is far from anonymous! "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20071127 Firefox/" "/" Proxy Detected -> VIA=1.1 FORWARD=aaa.bbb.ccc.ddd

Of course your average webmaster would never see this proxy information because it's not in your default log file, but I log proxy details and a whole lot more.

This is the DomainTools screen shot thumbnail generator hitting your site: "GET /" "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; {E12EDDF0-EE40-C76D-85D0-8861BDE2E7AE}; SV1; .NET CLR 1.1.4322)"

Here's their companion site which claims it uses robots.txt but didn't bother to check if I allowed them on my site until AFTER they had already been to the site as the access was in exactly the order shown below. "GET /" "-" "Mozilla/5.0 (compatible; AboutUsBot/0.9; +" "GET /robots.txt" "" "Mozilla/5.0 (compatible; AboutUsBot/0.9; +"

You might want to block AboutUsBot unless you want them to freely license whatever shit they scrape off your site with the claims on the bottom of their site:
All content is available under the terms of the GFDL and/or the CC By-SA License
If you want to keep them from snooping your site the IPs I'm currently blocking are:
64.246.165.* (screen shots)
So there's all I know at this time, you have robots.txt and htaccess files, you know what to do.


They are also running screenshots from 216.145.16.*
Wonder how many other blocks of IPs they're using?


I was accused of confusing DomainTools and!

I was never confused but whoever posted that I was confused apparently is just because I lumped them together because they operate from the same IP space, their whois records have the same address, and they have some shared data in common such as the thumbnails.

AboutUs uses the thumbnails from DomainTools and DomainTools Whois has a link from every domain to "AboutUs: Wiki article on ..." what would you call them?

I never said they were the same company, totally not confused, but whatever makes you happy.


I knew the connections would be spelled out somewhere on the 'net when I had a little more time to do some snooping on the site.
One of the questions posed was about our connection with Name Intellignece. Jay Westerdal, CEO of, in fact, recently stepped down as AboutUs CTO...
Confuse THAT!

Monday, December 31, 2007

How Much Nutch is TOO MUCH Nutch Revisited

To date there have been 585 unique IPs hitting my server since I started tracking this nuisance called nutch.

Here's a list of IPs with nutch sightings to date:
If I weren't blocking nutch my server would probably be down in flames from the nutch DDoS.

Nothing dangerous about giving away code, not a thing.