Saturday, December 08, 2007

Validate Link Integrity Using DNSBL's like SpamHaus ZEN

People tend to just think that lists from sites like SpamHaus are only good for blocking spam from coming into your servers but that's just the tip of the iceberg if you're open to some creative thinking.

Since Google penalizes sites that link out to bad neighborhoods one potential use for SpamHaus ZEN is to help automatically identify bad sites and remove them. For people that run directories or have massive amounts of outbound links this means you can protect your visitors, as well as your reputation in Google and other places, via zen.spamhaus.org and eliminate links to IPs associated with spammers, 3rd party exploits, proxies, worms and trojans!

How's that for a kick ass way to clean up your site?

Keep in mind that on a shared server that a single IP address may represent multiple domains on a server. That means any domain on a server either spamming or otherwise compromised will impact all domains associated with that IP so many people may be effected that don't know there's a problem. However, since that server can be a hazard to the general population at large, it's best to err on the side of caution and suspend your association with all sites on that server until the problem is resolved.

Since most sites don't even know that they've been infected I merely quarantine those links until they are no longer being reported as hostile and then enable them again after they have been confirmed to be clean.

Not that everything will be listed in SpamHaus ZEN as much of the malicious activity I see isn't in their index, but it's a good reference for known bad sites.

Here's an example of how to check an IP address in SpamHaus using a spammers IP currently in the DNSBL.

Take the IP address 64.151.120.13 and reverse it to 13.120.151.64 and then combine the IP address to zen.spamhaus.org like this: 13.120.151.64.zen.spamhaus.org.

Using any DNS checking tool, query the DNSBL for the existence of 13.120.151.64.zen.spamhaus.org.

The IP is currently in the DNSBL you'll get a result like this:

host 13.120.151.64.zen.spamhaus.org
13.120.151.64.zen.spamhaus.org has address 127.0.0.2
If the IP address is not in the DNSBL you'll get a response like this:
host 13.120.151.123.zen.spamhaus.org
Host 13.120.151.123.zen.spamhaus.org not found: 3(NXDOMAIN)
The result codes from SpamHaus are as follows:
127.0.0.2 - SpamHaus Block List (SBL)
127.0.0.4-8 - Exploits Block List (XBL)
127.0.0.10-11 - Policy Block List (PBL)
The last list, the PBL, is probably something I wouldn't auto-block with a link checker or any other use (except anti-spam) unless I reviewed what it was blocking first so those errors, if they ever come up, are only set as "warnings" in my current implementation.

Thursday, December 06, 2007

Bad Behavior Needs Behavior Modification

WebGeek recently reported on Bad Behavior Behaving Badly where he got locked out of all his own blogs and was listed as an enemy of the state and put on the FBI's 10 most wanted geek list and all sorts of things.

OK, I'm exaggerating but read his post and it's close enough.

Anyway, there was something he mentioned about being concerned with:

"If left unattended in this state for a long time, a site could lose valuable search engine rankings, after the spiders of the Big 3 (Google, Yahoo, and MSN) find that they are locked out repeatedly with 403 errors."
Since he mentioned it, I've looked over the source code for Bad Behavior before and how they validate robots isn't something I'd put on my website because it relies solely on IP ranges alone and they are incomplete based on raw information I've collected from the crawlers themselves.

The search engines have clearly stated that they may expand into new IP ranges at any time without notice and the only official way to validate their main crawlers is with full round trip DNS checking to validate Googlebot for instance with IP ranges as a backup just in case they make a mistake.

So this code could easily be obsolete at any time:
if( stripos($ua, "Googlebot") !== FALSE || stripos($ua, "Mediapartners-Google") !== FALSE) {
require_once(BB2_CORE . "/google.inc.php");
}

// Analyze user agents claiming to be Googlebot
function bb2_google($package)
{
if (match_cidr($package['ip'], "66.249.64.0/19") === FALSE && match_cidr($package['ip'], "64.233.160.0/19") === FALSE) {
return "f1182195";
}
return false;
}
Even more importantly, I've tracked Google crawlers in the following IP ranges which is 2 more IP ranges than Bad Behavior has in their code!
64.233.160.0 - 64.233.191.255
66.249.64.0 - 66.249.95.255
72.14.192.0 - 72.14.239.255
216.239.32.0 - 216.239.63.255
The same criticism exists for validating the other bots in that Bad Behavior needs to have a little more robustness in the validation code so that it isn't accidentally blocking valid robots from indexing web pages. Unless I'm missing something I don't even see where Yahoo crawlers are specifically validated (I'm tracking 11 IP ranges for Yahoo) and MSNBOT was missing the 131.107.0.0/16 CIDR range, etc..

As it stands, the code doesn't have all the IP ranges that I've seen used for any of the major search engines so there is some risk, albeit not a big risk, that some legitimate search engine traffic is being bounced.

Not only that, but the MSIE validation is full of holes and most of the stealth crawlers I block will zip right through Bad Behavior and scrape the blog.

I think WebGeek is right, I would disable the add-in until those issues are resolved.

LiteFinder REALLY Go Fuck Yourself Now

In my opinion this whole LiteFinder Network Crawler is completely bogus.

Yesterday I commented on their crawler, which now just appears to be a ruse to lure people to their web site which is nothing but a big front for affiliate links.

Go to the LiteFinder home page and take a look at the main topics: Adult: Penis Enlargement, Online Gambling or the popular searches for "Phentermine" or "Breast Enlargement Pill".

Riiiiight.

This site is so spammy it would make Sanford Wallace blush.

The so-called search feature doesn't search shit, it just spits up a bunch of bullshit links.

Here's the results for a query on PLUMBING:

Shop
Browse and compare a great selection of .
www.somesite.com

Save up to 95% - diamond jewelry, engagement rings, designer watches, and much more. Live auctions starting at one dollar
somedomain.com

Gold, and Silver Jewelry
Great selection of jewelry including Rings, Necklaces, Bracelets, Pendants, Earrings, Body Jewelry, and Spazio watches.
somejewelry.com

Bored? Check Out the Sumo!
Viral video mayhem. Games Galore. Sucker free music. Bangin' Hotties. Animation for your fascination. Go to the Sumo, live large and never be disappointed by a weak video website again.
www.somesite.com

Etc. you get the idea...

What purpose is a crawler that doesn't feed a search engine?

You've got it, it's a lure, we've been had.

This LiteFinder Network Crawler thing just needs to be blocked, that's all there is to it.

Wednesday, December 05, 2007

LiteFinder Network Crawler Go Fuck Yourself

I don't get too riled up until I read some self-serving pompous bullshit like this that just makes the hair stand up on the back of my neck:

Can I learn the IP addresses, which LiteFinder Network Crawler comes from?
Unfortunately, You can't since it is against the rules of our company.
The user agent for this mess is:
"Mozilla/5.0 (compatible; LiteFinder/1.0; +http://www.litefinder.net/about.html)"
Since they don't feel like sharing the IP addresses, let me do the honors since it's not against MY company policy:
208.101.44.3 -> mybluewine.net.
209.160.65.42 -> hopone.net.
209.62.109.178 -> ev1s-209-62-109-178.ev1servers.net.
216.40.220.34 -> ev1s-216-40-220-34.ev1servers.net.
216.40.222.50 -> ev1s-216-40-222-50.ev1servers.net.
216.40.222.66 -> ev1s-216-40-222-66.ev1servers.net.
216.40.222.82 -> ev1s-216-40-222-82.ev1servers.net.
216.40.222.98 -> ev1s-216-40-222-98.ev1servers.net.
67.19.114.226 -> w103.networkharmony.com.
67.19.250.26 -> 1a.fa.1343.static.theplanet.com.
70.85.113.242 -> f2.71.5546.static.theplanet.com.
74.53.243.226 -> e2.f3.354a.static.theplanet.com.
74.53.243.242 -> f2.f3.354a.static.theplanet.com.
74.53.244.18 -> 12.f4.354a.static.theplanet.com.
74.53.249.34 -> 22.f9.354a.static.theplanet.com.
74.86.209.74 -> templatestill.com.
74.86.249.98 -> westhoste.net.
75.125.18.178 -> ev1s-75-125-18-178.ev1servers.net.
75.125.47.162 -> ev1s-75-125-47-162.ev1servers.net.
75.125.52.146 -> ev1s-75-125-52-146.ev1servers.net.
84.19.176.208 -> ns.km22118.keymachine.de.
87.118.118.111 -> ns.km31417.keymachine.de.
87.118.98.57 -> ns.km22427.keymachine.de.
87.118.98.62 -> ns.km22426.keymachine.de.

There you go, all the IPs I've seen them use and they can shove the rules of their company where the sun doesn't shine.

Surge Protection - Get it before it's TOO LATE!

I know many of you think surge protection is a bunch of hype but the father of a good friend just found out a few days ago that surge protection is a must have. Lightning apparently zapped their house and took out every single appliance, TVs, radios, computers and a nice big Wurlitzer organ all in one shot totaling over $20K in damages.

That was just enough to make me get off my ass and double check that all of our most expensive gear, like my computer, printers, big screen TV, DVRs, etc. were all plugged into the proper place on the UPS/Surge protector since the rainy season is starting in California.

For those of you that still have doubts about surge protection, and the odds that lightning will never hit your house, let me tell you about an old buddy of mine from Kansas City. He had a computer that got hit by lightning on the power line, fried the box. He went out and got a new computer and a surge protector for the electrical line. Then about a year later lightning hit the phone line and blew his computer apart when it came in via the modem. Again, he replaced the computer and this time put a surge protector on his phone line as well. Unfortunately, God didn't want him to have a computer and the 3rd time lightning shot in through the window and blew the computer off his desk. Last time I checked they don't make surge protectors for windows.

Anyway, if you don't have a surge protector for your electrical, phone and cable it's time to install one and move the computer away from the window so lightning can't easily blast it off your desk just to show you who's boss.

GEO Targeting Issues with Sprint Wireless Broadband

Testing my new Sprint Wireless Broadband turned up something that I didn't quite expect in regards to Geo targeting because the IP addresses used all are attributed to Southern California and I'm in Northern California.

I understand that privacy is a concern and you don't want people to know exactly where you are but being off by 600 miles is a bit much as nothing works right that tries to Geo target and some things can become down right annoying, such as AdSense showing you ads for local shit in Irvine California.

Nothing show stopping, just annoying.