Saturday, November 10, 2007

Websense Stealth Crawler Bypassing Security?

What I find amusing are security companies that claim to be protecting the web while violating access control measures on web servers all over the world.

Here's what I see coming from WebSense that's obvious:

208.80.193.29 Mozilla/5.0 (compatible; Konqueror/3.0-rc2; i686 Linux; 20020108)
208.80.193.30 Mozilla/5.0 (compatible; Konqueror/3.0-rc4; i686 Linux; 20020418)
208.80.193.33 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312462)
208.80.193.34 Mozilla/5.0 (compatible; Konqueror/3.1; i686 Linux; 20020213)
208.80.193.36 Mozilla/5.0 (compatible; Konqueror/3.0-rc1; i686 Linux; 20020328)
208.80.193.37 Mozilla/5.0 (compatible; Konqueror/3.1-rc4; i686 Linux; 20020520)
208.80.193.41 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312466)
208.80.193.42 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312462)
208.80.193.51 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312460)
208.80.193.52 Mozilla/5.0 (compatible; Konqueror/3.0-rc6; i686 Linux; 20020204)
It makes me wonder if deliberately trying to bypass security measures in place that are designed to keep robots like WebSense off a server, such as robots.txt, .htaccess and other access controls, may violate the "Computer Hacking and Unauthorized Access Laws"?

Proving they've been busily sneaking around on lots of servers won't be too hard either.

Maybe WebSense should just claim any site that blocks them is off limits since we don't want them on our servers instead of trying to circumvent our security measures.

That would make too much sense wouldn't it?

Of course someone could claim that bad sites would just cloak clean content if they know it's WebSense. However, I'd rather give explicit permission for WebSense and then it wouldn't bother me so much if they crawled in stealth from a different IP address knowing that I gave permission in the first place.

Here's some of their known IP ranges:
Websense 66.194.6.0 - 66.194.6.255
Websense 74.211.167.208 - 74.211.167.215
Websense, Inc 208.80.192.0 - 208.80.199.255
Not sure these are the same company as a couple are in Canada and the other is in a different city, but what the heck, make up your own mind on these:
Websense Inc 67.117.201.128 - 67.117.201.143
Websense Systems Inc. 64.69.80.104 - 64.69.80.111
Websense Systems Inc. 64.69.80.96 - 64.69.80.103
There you go, some good bot blocking to go with your morning coffee should start off a fine Monday!

Thursday, November 08, 2007

How to Super Charge Your Link Checker

Most external link checkers people use can only detect the simple problems with your links such as servers being offline, missing pages (404 errors), or some other type of server error making your outbound link technically broken. These old school link checkers don't know how to detect the myriad of soft 404 errors that send a "200 OK" as a result. Worse yet, traditional link checkers aren't smart enough to detect whether your outbound links have changed hands and are possibly in a domain park, converted to a porn site, or possibly contain malware.

Here's a few tips for those that may want to super charge your link checker to detect domains that have transitioned into domain parks or parked pages and catch those soft 404 errors.

1. Do a full trip DNS check on your domain names.

Example of a full trip DNS check: somedomain.com -> ip address -> somedomain.com

The resulting full trip DNS lookup for some domain parked sites return these domains:

landing.hitfarm.com.
sedoparking.com.
ddwww.tucows.com.
information.com.
Parked pages on GoDaddy are a bit more complex because it's a combination of parkwebwin + secureserver.net but not too terrible to interpret:
parkwebwin-v03.prod.mesa1.secureserver.net
2. Whois Lookup for more detailed information.

If the full trip DNS fails to uncover anything useful then getting the WHOIS information about the domain name and/or IP address might yield interesting results. You might find the site is hosted at Thoughtconvergence.com which runs trafficz.com, a domain park, or is hosted at Parked.com (duh!) or shows DNS servers such as NS1.PARKED.COM.

3. Examine the redirects and landing page names.

When you request the URL, assuming you process your own redirects, you can observe that certain types of soft 404 errors redirect to the home page of some servers or a standard default page served up by admin control panels. Additionally, some parked pages also have intermediate redirects that clearly identify the page is being redirected to a landing page which can also be trapped.

Some sites return a "200 OK" but the page lands on a page name like "404error.html" or "404.asp" and there are a large list of these. Unfortunately, just looking for any page with "404" in the page name will kick out many false positives but recording a list of these will help you quickly find a good list of them.

Some samples of various types of 404 pages and URLs you might find:
http_404_filenot_found.htm
erreur404.asp
decommissioned.php
/suspended.page/
4. Examine the page content

The least accurate method is to actually process the page content of the landing page to look for various fingerprints that can be used to detect a site gone bad. Simple phrases such as "this site is temporarily not available" or "this web site coming soon" can spot sites that are no longer active. The problem with this method is that the text fingerprints can easily be changed, may generate some false positives, and is the least reliable. However, it's often the final recourse to detecting 100s of bad pages so you just keep updating your list of fingerprints as you find them and manually double check these types of broken links for false positives.

5. Compare the previous WHOIS profile

Save copies of all the whois information you get during link checking and use it in future link checks to detect ownership changes. Assuming the link checker passes the site after all of the above profile checks, compare the current WHOIS information to the last time you checked the site. Odds are that if the site has changed hands it no longer contains the content you originally linked to and may be a link you want to remove.

Summary

Now you know all of my basic ingredients for building a super charged link checker and should have some ideas on how to spruce up your own link checker. Building the ultimate link checker is nothing simple that can be accomplished in a day nor does working on it ever stop because the internet is constantly changing. However, if you have a ton of outbound links or run a large directory a super charged link checker is the only way to check links and time spent building the link checker is far better than manually checking tens of thousands of links by hand.

Another Stealth Crawler via Extended Host

Here we go with another stealth crawler operating from Extended Host:

194.110.162.19 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.225 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.227 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.228 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.231 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.84 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.85 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.86 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.87 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.88 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.89 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.92 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.93 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.94 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
194.110.162.96 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Here's the Extended Host IP range:
inetnum: 194.110.160.0 - 194.110.163.255
netname: EXTHOST-NET
descr: Extended Host
They just keep coming and I just keep closing more holes they slither through.

Tuesday, November 06, 2007

Even MORE Stealth Crawling Hosted at Corporate Colo

Here's yet another stealth crawler that came from Corporate Colocations's IP range:

74.124.192.137 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.138 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.161 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.162 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.175 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.181 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.183 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.195 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.198 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
74.124.192.215 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)
Here's their list of IP ranges:
Corporate Colocation Inc. MZIMA01-CUST-CORPCOLO02
64.235.225.8 - 64.235.225.15

Corporate Colocation Inc. MZIMA02-CUST-CORPCOLO04
216.193.219.0 - 216.193.219.255

Corporate Colocation Inc. MZIMA02-CUST-CORPCOLO01
216.193.197.0 - 216.193.197.255

Corporate Colocation Inc. MZIMA02-CUST-CORPCOLO03
216.193.208.0 - 216.193.208.63

Corporate Colocation Inc. CORPCOLO-206-62-132-0-22
206.62.132.0 - 206.62.135.255

Corporate Colocation Inc. CORPCOLO-206-62-144-0-23
206.62.144. - 206.62.145.255

Corporate Colocation Inc. CORPCOLO-206-62-146-0-22
206.62.146.0 - 206.62.149.255

Corporate Colocation Inc. MZIMA02-CUST-CORPCOLO05
216.193.251.0 - 216.193.251.255

Corporate Colocation Inc. NET-216-152-242-0-24
216.152.242.0 - 216.152.242.255

Corporate Colocation Inc. CORPCOLO-NET
205.134.224.0 - 205.134.255.255

Corporate Colocation Inc. NET-216-151-149-0-24
216.151.149.0 - 216.151.149.255

Corporate Colocation Inc. MZIMA02-CUST-CORPCOLO10
72.37.152.0 - 72.37.152.255

Corporate Colocation Inc. MZIMA03-CUST-CORPCOLO09
72.37.131.80 - 72.37.131.87

Corporate Colocation Inc. CORPCOLO-NET02
66.117.0.0 - 66.117.15.255

Corporate Colocation Inc. CORPCOLO-NET03
74.124.192.0 - 74.124.223.255

Corporate Colocation MZIMA01-CUST-CORPCOLO05
64.235.225.224 - 64.235.225.239

Corporate Colocation MZIMA01-CUST-CORPCOLO06
64.235.227.96 - 64.235.227.111

Corporate Colocation MZIMA01-CUST-CORPCOLO08
64.235.238.224 - 64.235.238.231

Corporate Colocation MZIMA01-CUST-CORPCOLO07
64.235.237.64 - 64.235.237.71

That little list of IPs should give you all some fun adding to your firewalls and .htaccess files.

Enjoy.

More Stealth Crawling Hosted at OC3Networks

No clue who or what this crawler is but it's coming from OC3Networks datacenter.

Here's the IPs and the user agent coming from their network:

72.11.155.106 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.112 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.113 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.125 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.131 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.137 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.154 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.197 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.204 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.211 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.219 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.223 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.228 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.236 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.237 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.246 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.34 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.37 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.45 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.5 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.57 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.61 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.63 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.64 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.67 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
72.11.155.90 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET CLR 1.1.4322)
Here's their ranges of IPs:
OC3 Networks & Web Solutions, LLC OC3-NETWORKS
66.63.160.0 - 66.63.191.255

OC3 Networks & Web Solutions, LLC OC3-NETWORKS
72.11.128.0 - 72.11.159.255

OC3 Networks ISWT-207-178-200-0
207.178.200.0 - 207.178.200.63

OC3 Networks OC3-NETWORKS-DSLUSERS
66.63.163.0 - 66.63.164.255

OC3 Networks OC3-NETWORKS--DEDICATED-SERVERS-RANGE
66.63.176.0 - 66.63.176.255

OC3 Networks OC3-NETWORKS--DEDICATED-SERVERS-RANGE
66.63.179.0 - 66.63.179.255

OC3 Networks OC3-NETWORKS---COLOCATIONS-VOIP
66.63.178.0 - 66.63.178.127
Not sure I'd block the DSLUSERS range but the rest look like fair game.

Enjoy.

Munax Stealth Crawler

Stumbled upon a stealth crawler hitting my site from multiple IPs and it turned out to belong to Munax who claims right up front that they haven't named their crawler and fake being a legit user which is pretty damned scummy.

My guess would be they figured out they couldn't access sites with good security so they decided to get around it without a bot name, but here's some bullshit excuse they use:

Our crawler does not have a "name", yet. Instead it announces itself to be a standard web browser, a "Mozilla 4.0" kind-of-browser compatible with the browser Microsoft Internet Explorer 6.0, running on the Windows NT 5.1 operating system. The reasons for this are: (a) Today, web servers are intelligent enough to react on the type of user agent. If our crawlers had a name, say MunaxRob or something like that, many web servers would not know about it and would return junk or maybe nothing at all. (b) We want the web server to return a page to us where the page looks as close as possible to a page that can be viewed with a standard web browser. This, to create the best possible indexing in our database and a WYSIWYG experience for anybody that is visiting our search engine.
Well listen up fuckheads, there's a reason we would return junk or nothing at all which is we don't want your goddamn spider crawling our fucking website!

What part of FUCK OFF! don't you understand that drives you to bypass our security and crawl regardless of whether we want you or not?

Amazingly they admit their IP range:
Your site might have been visited by our crawlers, with network addresses in the range of 82.99.30.2 - 82.99.30.73. Here is a short FAQ answering some of the questions you might have:
I've confirmed this crawl range in my logs:
82.99.30.15 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.17 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.21 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.25 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.26 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.30 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.33 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.37 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.45 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.54 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
82.99.30.67 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Well, this fucking crawler is now blocked.

Bunch of bullshit....