Saturday, January 05, 2008

Why The Hell Is Bloglines Crawling?

Let's start this investigation by noting that Bloglines themselves claim to be a crawler now when you use reverse DNS on their IP address:

65.214.44.29 -> crawler.bloglines.com
This is what Bloglines is supposed to do, read your RSS feed:
65.214.44.29 "GET /rss_feed.xml" "-" "Bloglines/3.1 (http://www.bloglines.com;XXX subscribers)"
However, they've stepped off the RSS path and started coloring outside the lines!

The first off thing I noticed was it asked for robots.txt without any user agent defined:
65.214.44.29 "GET /robots.txt" "-" "-"
So I dug a little deeper and it appears they are running Firefox Minefield which was asking for a bunch of images from 3rd party websites where my graphic appears:
65.214.44.29 "GET /myimage.gif" "http://someotherwebsite.com/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"
Finally, I found them requesting some web pages that are NOT in any RSS feed, what the fuck?
65.214.44.29 "GET /anyoldpage.html" "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"
So, anyone have a clue what they're doing?

SCREENSHOTS!

Yes, they're making screen shots that appear on ASK.com!

I looked up a few pages from one of my sites in ASK and sure enough, instead of screen shots of the actual web pages there were screen shots of error messages with the Bloglines IP address of 65.214.44.29 in big bold numbers.

The reason I figured that out so easily was I recently decided to just block everything claiming to be coming from Linux just to see what came up and that's why they got an error page instead of a screen shot. Sure, I'm probably blocking a few innocent Linux users as well but they account for an insignificant part of my traffic and overlap with the same tools that servers use so sacrifices were made.

Anyway, what we've learned is that Ask is using Bloglines' IP to make screenshots and look at your robots.txt file yet they don't disclose what they're even looking for in your robots.txt file.

Wasn't that fun?

Friday, January 04, 2008

Does Covenant Eyes Divulge Their Members?

While monitoring activity from Covenant Eyes on one of my servers it became obvious that many of the pages being accessed were fairly unique, not as popular, and easily allowed me to figure out the actual customer Covenant Eyes was watching.

To test my theory I checked the log file for one unique page Covenant Eyes requested and sure enough only a single IP had accessed that file during the course of the day.

Then I got a list of all files that this visitor's IP had viewed and compared it to all the files that Covenant Eyes requested and it was an exact match in the exact same order of access, without any obfuscation, so it was a 100% match without a doubt.

I've been monitoring this situation for several days now and it's always the same.

The visitor comes and views some pages and about 90-120 minutes later Covenant Eyes comes and asks for the exact same pages in the exact same order.

Here's a sample of a visitor's access:

127.0.0.1 "justapage.html"
127.0.0.1 "anyoldpage.html"
127.0.0.1 "justanotherpage.html"
127.0.0.1 "veryspecialpage.html"
127.0.0.1 "anotherrandompage.html"
A while later Convenant Eye's asks for the same pages in the same order:
69.41.14.x "justapage.html"
69.41.14.x "anyoldpage.html"
69.41.14.x "justanotherpage.html"
69.41.14.x"veryspecialpage.html"
69.41.14.x "anotherrandompage.html"
Same pages, same order, definite match with a unique page like "veryspecialpage.html" that nobody else visits on the same day. Additionally, they appear to do each customer's files they monitor very quickly in a batch so it's pretty easy to see that those files are related to a single visitor making identification even simpler.

Now with a simple script I can find out who they were monitoring with extreme accuracy as long as the visitor looked at more than one page unless that one page was unique and nobody else looked at that page during the day.

Making it harder to identify which visitor they're monitoring wouldn't be that difficult just by staggering and randomizing their page requests over the course of the day. However, I still don't see how you could protect the identity of your customer if that was the only customer of the day that accessed that web site unless you throw in a few bogus page requests to throw a webmaster off the trail. Even with randomization and fake page requests you would still have a problem if that customer was the only one to access a specific page as mentioned above, but at least it would be a start in making the monitoring activity just a little more covert and possibly less traceable.

The site of mine where I did this experiment, which isn't this blog, gets from 20K-40K visitors daily, so if I can easily find a needle in that big haystack then it would be trivial on a low traffic site.

Tuesday, January 01, 2008

Romanian Scrapers Go Apeshit on New Years Day

The stealth scrapers attempting to hit my site have been really laid back lately but on Jan 1 '08 the Romanian scrapers went apeshit, or at least tried, followed by a few others.

Needless to say, the bot trap was very busy today.

So far today this is what the little Romanian fuckers tried:

89.122.29.31 requested 333 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

89.122.16.96 requested 336 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

89.122.29.35 requested 337 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

89.122.29.32 requested 336 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
Then someone from Vietnam tried to join the fun:
203.162.3.153 requested 340 pages as "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"
A quick visit from the Ivory Coast:
41.207.2.87 [host-41-207-2-87.afnet.net.] requested 339 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"
Then maybe a human with issues...

Someone from Venezuela gave a quick visit with what appeared to be a broken browser that asked for a bunch of pages that the visitor probably wasn't aware happened:
201.210.138.88 [201-210-138-88.genericrev.cantv.net.] requested 153 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
Every time the browser would ask for a page it would then ask for the home page about 5-10 times in just a few seconds, what the fuck is up with that?

Anyway, it was considered an automated attack, fuck it.

Anyone else have a wild scrape attack today?

How to Identify Screen Shot Makers

Have you ever wondered how I figure out where screen shots originate from?

My trick of the trade is the SPARE DOMAIN!

All my unused domain does is print out information about whoever or whatever just visited the site with the IP address in REALLY BIG BOLD LETTERS so it's easy to read on a small screen shot thumbnail.

Therefore, if someone makes a screen shot I can tell who's doing it just by looking at the screen shot and block them from doing it a second time if I don't like what they're doing with thumbnails of my site.

DomainTools Whois and AboutUs Site Accesses Revealed

The DomainTools Whois is now collecting and displaying more information than ever about our web sites. Their Whois display used to be limited mostly to public registration information such as Whois, the IP address, where you host and the basics. Then DomainTools expanded Whois a while back and started taking data straight from our domains without permission and doesn't even look at robots.txt to see if we want to participate. The screen shots were no big deal but then they added some SEO text browser that allows people to snoop on your site and who knows what's next.

If that wasn't enough, then along came their Wiki companion site AboutUs.org, which scraped off some data as well. AboutUs does seem to use robots.txt, see backwards robots access below, but by the time you find out about the bot it's too late because you already have scraped content on your domain's AboutUs Wiki page.

Enough is enough, it's official, I'm annoyed.

Since I could find no way to "opt-out" of all the new toys on DomainTools Whois I decided it was time to opt-out the old fashioned way and just block 'em.

If they had just identified themselves in the User Agent this would've been easy because those are all monitored on my main site automatically. However, it appears that DomainTools either doesn't know how to put their information in the User Agent field for the tools they use, or they really don't want to get snared and stopped easily, because they use standard Firefox and MSIE user agents for accessing your site.

However, note that the referrer does claim that it's coming from DomainTools so you can at least use that as an indication it's them although the User Agent field would've been preferred since it is the standard for this sort of thing.

Here's a sample of DomainTools SEO Text Browser hitting your server:

66.249.16.212 "GET /" "http://whois.domaintools.com/somedomainname.com" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
The SEO Text Browser thing looks like it might be telling the webmaster who's snooping on their site because I caught it claiming to be a proxy that was forwarding information for my IP address when I was looking at the site so using it is far from anonymous!

66.249.16.211 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11" "/" Proxy Detected -> VIA=1.1 www.domaintools.com FORWARD=aaa.bbb.ccc.ddd

Of course your average webmaster would never see this proxy information because it's not in your default log file, but I log proxy details and a whole lot more.

This is the DomainTools screen shot thumbnail generator hitting your site:

64.246.165.237 "GET /" "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; {E12EDDF0-EE40-C76D-85D0-8861BDE2E7AE}; SV1; .NET CLR 1.1.4322)"

Here's their companion site AboutUs.org which claims it uses robots.txt but didn't bother to check if I allowed them on my site until AFTER they had already been to the site as the access was in exactly the order shown below.

66.249.16.207 "GET /" "-" "Mozilla/5.0 (compatible; AboutUsBot/0.9; +http://www.aboutus.org/AboutUsBot)"

66.249.16.207 "GET /robots.txt" "http://www.somedomainname.com/" "Mozilla/5.0 (compatible; AboutUsBot/0.9; +http://www.aboutus.org/AboutUsBot)"

You might want to block AboutUsBot unless you want them to freely license whatever shit they scrape off your site with the claims on the bottom of their site:
All content is available under the terms of the GFDL and/or the CC By-SA License
If you want to keep them from snooping your site the IPs I'm currently blocking are:
66.249.16.*
66.249.17.*
64.246.165.* (screen shots)
So there's all I know at this time, you have robots.txt and htaccess files, you know what to do.

UPDATE:

They are also running screenshots from 216.145.16.*
Wonder how many other blocks of IPs they're using?

UPDATE UPDATE:

I was accused of confusing DomainTools and AboutUs.org!

I was never confused but whoever posted that I was confused apparently is just because I lumped them together because they operate from the same IP space, their whois records have the same address, and they have some shared data in common such as the thumbnails.

AboutUs uses the thumbnails from DomainTools and DomainTools Whois has a link from every domain to "AboutUs: Wiki article on ..." what would you call them?

I never said they were the same company, totally not confused, but whatever makes you happy.

UPDATE UPDATE UPDATE:

I knew the connections would be spelled out somewhere on the 'net when I had a little more time to do some snooping on the site.
One of the questions posed was about our connection with Name Intellignece. Jay Westerdal, CEO of NameIntelligence.com, in fact, recently stepped down as AboutUs CTO...
Confuse THAT!

Monday, December 31, 2007

How Much Nutch is TOO MUCH Nutch Revisited

To date there have been 585 unique IPs hitting my server since I started tracking this nuisance called nutch.

Here's a list of IPs with nutch sightings to date:

12.47.49.97
13.1.137.86
13.1.139.202
13.1.139.205
13.1.139.206
13.1.139.211
13.1.139.212
13.1.139.213
15.203.249.124
24.12.140.54
24.222.153.250
24.231.207.219
24.247.204.244
24.5.71.1
24.6.168.184
24.94.62.119
35.10.2.90
58.186.61.164
58.187.12.236
58.187.22.230
58.215.74.242
58.215.74.253
58.215.75.2
58.68.42.138
58.87.139.90
59.160.240.115
59.160.240.116
59.160.240.183
59.160.240.184
59.160.240.185
59.176.10.136
60.248.9.114
61.135.151.175
61.246.2.241
61.8.140.20
62.129.132.47
62.168.188.151
62.192.109.66
62.192.11.2
62.40.33.173
62.40.36.87
62.54.4.138
63.133.162.98
63.246.7.209
63.82.23.2
64.105.36.210
64.106.247.178
64.18.197.136
64.209.138.200
64.229.206.25
64.229.222.170
64.229.226.126
64.229.33.51
64.231.233.162
64.236.128.27
64.241.242.18
64.242.88.10
64.242.88.60
64.34.172.78
64.34.180.167
64.38.10.26
64.47.51.158
64.71.164.125
65.120.64.146
65.220.67.9
65.92.160.39
65.95.155.163
66.132.240.180
66.132.249.23
66.135.44.34
66.135.44.35
66.135.44.36
66.135.44.37
66.135.44.38
66.135.44.39
66.135.44.40
66.135.44.41
66.135.44.42
66.135.44.43
66.135.44.44
66.135.44.46
66.135.44.48
66.135.44.49
66.135.44.50
66.135.44.51
66.135.44.52
66.135.44.53
66.15.68.234
66.207.120.226
66.24.192.59
66.24.198.171
66.24.199.39
66.24.240.206
66.243.31.34
66.30.10.222
66.92.153.138
67.110.56.45
67.110.58.2
67.111.28.139
67.184.246.61
67.202.20.30
67.202.49.49
67.202.6.11
67.52.101.242
67.68.42.2
67.70.155.226
67.71.89.27
67.95.51.86
68.178.171.109
68.178.202.79
68.205.124.164
68.205.127.94
68.228.72.198
68.97.222.117
69.248.26.83
69.36.233.8
69.55.233.28
69.60.125.233
69.90.45.7
69.93.236.178
70.143.79.234
70.187.130.253
70.197.81.79
70.21.122.162
70.48.46.56
70.50.75.8
70.56.66.216
70.62.103.114
70.85.198.178
70.87.14.34
70.90.188.18
70.96.99.254
71.216.0.210
71.217.33.149
71.241.153.125
71.35.163.79
71.98.182.170
72.0.207.162
72.2.25.66
72.2.25.67
72.2.25.71
72.21.6.146
72.21.6.147
72.21.6.148
72.232.202.50
72.232.223.234
72.232.228.58
72.233.38.194
72.233.38.195
72.233.38.196
72.233.38.197
72.36.114.145
72.36.114.147
72.36.115.42
72.36.115.45
72.36.115.47
72.36.115.52
72.36.115.53
72.36.115.54
72.36.115.56
72.36.115.57
72.36.115.59
72.36.115.64
72.36.115.65
72.36.115.68
72.36.115.69
72.36.115.70
72.36.115.72
72.36.115.73
72.36.115.74
72.36.115.77
72.36.115.79
72.36.115.80
72.36.94.100
72.36.94.106
72.36.94.107
72.36.94.109
72.36.94.110
72.36.94.112
72.36.94.113
72.36.94.118
72.36.94.119
72.36.94.121
72.36.94.122
72.36.94.123
72.36.94.124
72.36.94.169
72.36.94.173
72.36.94.176
72.36.94.179
72.36.94.181
72.36.94.182
72.36.94.20
72.36.94.201
72.36.94.203
72.36.94.243
72.36.94.38
72.36.94.39
72.36.94.48
72.36.94.50
72.36.94.52
72.36.94.54
72.36.94.56
72.36.94.60
72.36.94.61
72.36.94.68
72.36.94.90
72.36.94.92
72.36.94.96
72.36.94.99
72.36.95.12
72.36.95.131
72.36.95.134
72.36.95.145
72.36.95.146
72.36.95.147
72.36.95.148
72.36.95.149
72.36.95.150
72.36.95.152
72.36.95.154
72.36.95.155
72.36.95.156
72.36.95.157
72.36.95.158
72.36.95.160
72.36.95.161
72.36.95.162
72.36.95.165
72.36.95.166
72.36.95.167
72.36.95.168
72.36.95.170
72.36.95.173
72.36.95.176
72.36.95.177
72.36.95.178
72.36.95.179
72.36.95.183
72.36.95.185
72.36.95.207
72.36.95.209
72.36.95.212
72.36.95.214
72.36.95.217
72.36.95.218
72.36.95.226
72.36.95.227
72.36.95.230
72.36.95.231
72.36.95.232
72.36.95.236
72.36.95.237
72.36.95.238
72.36.95.239
72.36.95.251
72.44.58.104
72.44.58.167
72.44.58.173
72.44.58.244
72.44.58.252
72.44.62.107
72.44.62.122
72.44.62.124
72.44.62.151
72.44.62.162
72.44.62.166
72.44.62.197
72.44.62.199
72.44.62.208
72.44.62.245
72.5.173.12
72.5.173.22
72.51.37.148
72.84.30.230
74.111.22.20
74.111.7.226
74.208.11.120
74.39.192.237
74.52.54.130
74.69.164.2
74.98.30.178
74.98.32.176
75.126.142.100
75.126.204.194
75.44.225.44
80.38.119.131
80.79.35.55
81.173.148.94
81.173.155.210
81.203.142.109
81.67.169.232
81.93.168.211
82.150.138.138
82.150.138.139
82.16.40.198
83.149.77.7
83.246.79.28
84.101.58.177
84.101.58.70
84.191.111.92
84.231.72.32
84.231.74.47
84.57.138.191
85.117.62.114
85.145.108.135
85.17.184.39
85.17.184.41
85.177.142.252
85.179.194.32
85.179.196.134
85.18.14.22
85.214.83.174
85.52.193.36
85.88.35.34
85.88.35.35
85.88.35.37
85.88.35.41
87.139.106.60
87.233.142.106
87.242.77.169
87.69.22.130
87.98.222.116
88.191.23.109
88.198.212.50
88.74.95.48
89.149.208.224
89.31.118.248
123.113.184.253
124.157.145.165
124.32.246.36
124.32.246.45
128.174.240.249
128.174.240.251
128.174.241.130
128.174.245.163
128.208.1.160
128.208.3.173
128.208.4.10
128.208.6.125
128.208.6.200
128.208.6.207
128.208.6.226
128.208.6.227
128.208.6.232
128.208.6.75
128.208.6.77
128.238.35.93
128.95.1.189
128.97.88.68
128.97.88.70
129.242.19.138
129.34.20.19
129.78.64.106
131.112.125.102
131.112.125.103
131.112.125.104
131.112.125.106
131.112.16.220
131.211.84.21
132.178.248.36
132.178.248.47
133.30.112.143
140.247.62.79
140.247.62.80
141.30.193.12
141.30.193.5
141.30.193.6
144.92.194.22
145.99.243.67
147.202.73.2
147.202.74.2
147.202.76.2
147.202.81.2
147.202.90.2
159.226.5.82
164.67.195.201
164.67.195.245
164.67.195.26
164.67.195.27
164.67.195.67
164.67.195.68
164.67.195.86
166.214.93.76
192.17.240.18
192.17.240.19
192.17.240.20
192.17.240.21
192.17.240.22
192.17.240.25
192.17.240.26
192.17.240.27
192.17.240.28
192.17.240.29
192.17.240.30
192.17.240.32
192.17.240.33
192.17.240.34
192.17.240.36
192.17.240.41
192.17.240.42
192.17.240.43
192.17.240.44
192.17.240.45
192.17.240.46
192.17.240.47
192.17.240.48
192.17.240.50
192.17.240.52
192.17.240.53
192.17.240.54
192.17.240.55
192.17.240.56
192.17.240.57
192.17.240.58
192.17.240.59
192.17.240.60
192.17.240.62
192.17.240.65
192.17.240.71
192.17.240.73
192.17.240.74
192.17.240.76
192.17.240.79
192.17.240.81
193.138.250.141
193.138.250.237
193.145.45.68
193.203.240.117
193.203.240.118
193.203.240.119
193.203.240.120
193.203.240.121
193.203.240.122
193.203.240.135
193.205.213.166
193.252.148.51
193.42.229.3
193.42.84.5
194.153.145.119
194.153.145.15
195.250.53.25
195.72.131.70
195.72.131.71
195.72.131.72
195.72.131.73
195.72.131.74
195.72.131.75
195.72.131.76
195.72.131.77
195.72.131.78
195.72.131.79
195.72.131.80
195.72.131.81
195.72.131.82
195.72.131.85
195.72.131.86
195.72.131.87
195.72.131.88
195.72.131.89
195.72.131.90
195.72.131.91
195.72.131.92
195.72.131.93
196.203.50.219
198.87.235.130
198.87.235.142
199.4.160.10
200.152.240.214
202.10.82.98
202.174.61.198
202.20.190.235
202.20.192.195
202.69.141.20
202.98.1.120
203.113.130.205
203.147.0.44
203.199.83.162
203.244.218.1
204.123.46.105
204.123.47.91
204.228.230.38
204.228.230.43
206.222.21.2
206.222.9.122
207.115.108.202
207.176.224.241
207.176.224.244
207.176.224.245
207.214.93.42
208.109.126.135
208.64.57.65
208.96.10.200
208.96.10.201
208.96.54.71
208.96.54.72
208.96.54.73
208.96.54.76
208.96.54.77
208.96.54.79
208.96.54.80
208.96.54.81
208.96.54.82
208.96.54.83
208.96.54.84
208.96.54.85
208.96.54.86
208.96.54.88
208.96.54.89
208.96.54.90
208.96.54.91
208.96.54.95
209.139.209.220
209.139.209.224
209.51.212.10
209.51.212.18
209.51.212.26
209.85.62.159
209.85.62.162
209.85.88.150
210.174.3.130
210.196.73.193
210.245.31.15
210.245.31.18
211.152.34.34
212.101.97.63
212.12.114.238
212.137.33.140
212.156.230.210
212.166.192.129
212.174.130.121
212.174.130.122
212.58.116.72
213.132.171.245
213.132.175.101
213.157.204.141
213.219.170.12
213.251.133.12
216.163.188.200
216.163.188.201
216.182.225.186
216.182.229.37
216.182.229.39
216.182.229.91
216.182.230.40
216.182.230.54
216.182.230.75
216.182.236.46
216.182.236.77
216.182.237.45
216.182.238.83
216.231.36.92
216.24.131.152
216.58.87.217
216.93.185.12
217.10.144.242
217.106.233.192
217.153.59.26
217.31.51.128
217.80.112.146
218.25.39.81
220.130.191.231
220.130.191.232
220.130.191.233
220.130.191.234
220.130.191.235
220.130.191.236
220.130.191.237
220.130.191.238
220.130.191.239
220.130.191.240
220.226.195.162
220.226.195.163
220.226.195.165
220.226.195.166
220.226.195.167
220.226.195.168
221.114.253.210
221.116.237.114
221.221.140.114
221.221.237.35
222.173.249.33
222.210.196.26
222.46.17.43
222.46.17.47
If I weren't blocking nutch my server would probably be down in flames from the nutch DDoS.

Nothing dangerous about giving away code, not a thing.