Saturday, July 15, 2006

SEO'd myself into being an expert geologist

According to the search engines Google #1, MSN #1, and Yahoo #2, I'm an expert geologist on the topic of Lava Pockets at the time of this posting.

Guess this pretty much blows the theory about search engines picking authority sites on a topic based on trust as I don't know shit about lava yet rank high for something that was completely silly rant and off topic for geology. More amusing is that I mentioned Hot Pockets in that blog post and now it's been dragged into the listings currently at #3 on Google, #2 on MSN and #20 on Yahoo.

Just goes to show how easy you can get attention on some lesser fought after keywords without really trying but you can bring along friends like Hot Pockets just by referencing them.

FWIW, I'm still amused that MSN still finds me the number one resource on sex education and they send constant inquires from the middle east, which brings up other issues, but that's for another rant. Not like I tried to get them in the first place, but I lost some of my positions on those lovely terms, but a few of them are still in the top 10.

My mother would be so proud...

Not.

Friday, July 14, 2006

Scraping via Google Translator

The other day I was tightening up bot blocker security just a bit to not only verify requests are coming from the Google IP range but specifically which bots were asking for information instead of the carte blanche approach that "If it's from Google, it must be good" which was a bullshit assumption.

Sure enough, I found something crawling my site at a pretty good pace today and it was someone using the Google Translator to scrape AND translate my site all at the same time.

Isn't that amusing!

Pretty sure it wasn't any type of Googlebot as it didn't ask for robots.txt and requested things like "/#top" which Google doesn't try to crawl, nor would a human in a browser send that request, so it's a bad bot using a loophole.

So follow along kiddies to what I've done to date:

  • Locked Googlebot access by known ranges of Google IPs to stop Googlebot spoofing
  • Installed NOARCHIVE to stop scraping via Google's cache index
  • Blocked PROXY servers when Google comes crawling through one to avoid page hijacking
  • Tightened security to specifically look for Googlebot or Mediapartners only to avoid nonsense via the web accelerator or other nonsense services they provide
Then, after ALL THAT, I find Google has yet another vulnerability which is the translator, which has probably been used to scrape me for months now, and they dont seem to care when someone is asking for pages at 1 second or less per page either.

What a joke Google, what a joke...

This is why I keep ranting about PROXY servers being bad, yet ANOTHER example of how any type of proxy, which in effect is what Google translator is, can be exploited.

How can I prove to you it's a bot?

When bad behavior is detected my bot blocker will CHALLENGE the requests with a captcha of some sort, might be a simple one, might be a hard one, but this crawler via the translator asked for 159 pages which, up to a point, were all unanswered captchas, then messages about being blocked for bad behavior, and it still kept going asking for different pages one after the other at a rapid pace.
CHALLENGE: 64.233.178.136 [hs-out-f136.google.com.] requested 159 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1),gzip(gfe) (via translate.google.com)"
Now some of you might point out that it could've been a lot of people going thru the proxy server at the same time trying to translate pages. That's easy for me to refute as I track the proxy information, if present, when I log bogus page requests and most of them came from the same IP address in Brazil.
CHALLENGE 64.233.178.136 "Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-BR; rv:1.7.8) Gecko/20050511 Firefox/1.0.4,gzip(gfe) (via translate.google.com)"

Proxy Detected -> VIA=1.0 translate.google.com (TWS/0.9), 1.0 proxy.google.com:80 (squid)

FORWARD=201.35.249.163

nslookup 201.35.249.163
name = 201-35-249-163.bnut3703.dsl.brasiltelecom.net.br.
On top of all that, it looks like Google jacks up my javascript in the captcha when they run it thru the translator so if a legitimate visitors, unlike the crawling asshole from Brazil, does something that invokes a challenge you're just fucked as you can't break out.

Oh joy, more shit to debug.

Thank you Google.

You know what's real fucking hysterical about Google breaking my javascript captcha?

The cheap ass CGI proxy servers run by kiddies trying to get to MySpace from school don't even break my javascript, so this is truly some PhD worthy software that broke my shit.

FYI, I asked Matt Cutts to pony up the actual IP's of Googlebot so I could be more precise and his answer was:
IncrediBILL, I don’t think we’ve done so in the past because it changes from time to time, and we didn’t want to give bad/stale information.
Earth to Google, just post the damn IP list for all your crawlers and those of use using it for security will worry about updating our sites. Maybe you should include new IPs with a lead time like 7 days in advance to give everyone a chance to update. Put the list in an XML file and we can automate updating our security, not a problem, really, as it's better than letting idiots scrape my site via your swiss cheese security on your translator!

Thursday, July 13, 2006

DIT - Dissident Internet Technology

Don't get me wrong, I have no problem with the Chinese people having open access to the internet, but not via sneaky assed cached DIT proxies and not on my server.

If you want internet freedom you can a) move elsewhere like many of my Chinese friends and neighbors or b) overthrow your damned government and stop poking around in firewalls like children trying to bypass NetNanny.

Not like we have any internet freedom as the fucknuts in Washington state passed a law about online gambling that can have your ass tossed in prison just for WRITING about online gambling, not even doing it, so we have our own internet problems and politicians tossing out the 1st Amendment that need overthrowing right here at home.

The proxy details:

66.98.206.97
[ev1s-66-98-206-97.ev1servers.net.]
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1), DynaWeb http://www.dit-inc.us/disclaimer.php
The amusing part was, unless it was spoofed, the IP being forwarded was 65.59.219.179 which belongs to Level 3 Communications, which certainly isn't in China unless it was yet another proxy server relaying from there.

Sorry people, I know it's harsh, but bad shit goes down via proxies and Chinese IP's have knocked my server offline as long as 90 minutes at a time asking for hundreds of pages a second.

Not that I don't have sympathy for their situation, but I have NO fucking sympathy for anything that has potential to enable my server going offline.

A Daily Rant from a kindred spirit

Looks like somewhere out in the internet someone else has A Daily Rant that is also pissed about bots. For those of you doing it the old fashioned way and still blocking bots one at a time, he provides snippets of code for your antiquated .htaccess files and has a big list of bots to block ready to install.

Maybe some day he'll have an epiphany and come around to the whitelist method and stop wasting his time with the endless list of bots that just don't stop with a new bot found almost daily.

The site has some good information so I think you'll all enjoy it but he needs more cursing, it's pretty tame ;)

1&1 Web Host Goes Spamming

When you want some bad publicity just send some spam and you'll get all you need.

For starters, I'm not in any opt-in, double opt-in or any other kind of bullshit spam list unless someone stuck my name in there. I despise spam so whatever excuse they make about how they sent me this spam is a FUCKING LIE whatever it is so don't even bother trying to post excuses if you read this as I NEVER signed up, no way, no how, not possible, and you spamming morons can kiss my big fat white hairy ass.

It appears this email came from some SPAM operation called DesertWalls hosted on Interland.

Interland.Return-Path:
Received: from desertwalls.com (66.33.17.22)
From: "eMarketing" <SPAMMING_FUCKWADS@desertwalls.com>
Message-ID: <fuck_you_desert_walls@desertwalls.com>
Subject: Exclusively from the World's Largest Web Host
X-Complaints-To: SPAMMING_ASSHOLES@desertwalls.com

This is 1&1's bullshit ad:



The ad was originally loaded from:
http://69.59.147.20/1and1ad4.jpg
When you clicked the ad it went to a tracking script:
http://desertwalls.com/t/xxxxx/xxxxx
Which was redirected here:
http://order.1and1.com/xml/order/HostingHome
Which is owned by 1&1:
domain: 1and1resources.com
registrant-firstname: Thomas
registrant-lastname: Vollrath
registrant-organization: 1&1 Internet Inc.
registrant-street1: 701 Lee RD
registrant-street2: Suite 300
registrant-pcode: 19087
registrant-state: PA
registrant-city: Wayne
registrant-ccode: US
registrant-email: support@1and1.com
Don't know if this was one of their affiliates and I really don't fucking care but it looks like 1&1 sent the email as this was at the bottom:
If you no longer wish to receive messages from 1&1 Internet, please send a blank email with the single word "Unsubscribe" in the subject line to remove@1and1resources.com. Make your mark on the web with 1&1, the only web hosting company trusted by over 5 million people worldwide.
If 1&1 didn't send it then it would be too obvious they had an issue with the affiliate when all the unsubscribe requests started pounding their server, so it's probably them.

So take your webhosting service and shove it up your spamming ass.

Thank you.

Wednesday, July 12, 2006

Honey Pots are History - Fix Your Site or Block Bots!

There's a really cool anti-spam site up called Project Honey Pot that seems to be snaring a ton of spambots and identifying where the spammers crawl from, spam from and what they are spamming. However, the problem with this valiant but slightly misguided effort is that it assumes you can stop spambots from harvesting if you can out them, which blew up in Blue Security's face.

The true way to stop spambots isn't trying to feed them honey, it's fixing your website so there isn't any email addresses for them to harvest in the first place. Wow, isn't that a revelation, no email address on the page, nothing to harvest, end of spambots.

Two simple ways to accomplish the end of spambots are CONTACT FORMS and BOT BLOCKERS. If you purge your site of all email addresses and use a secure form with a captcha to stop spammers, then you should eliminate email harvesting and form spamming in one shot. Combine this with a bot blocker that stops all the vermin from crawling your site in the first place and you've got a 2-pronged strategy that should stop them cold.

Not that honey pots aren't cute as they snare these idiots, it just makes more sense to EDUCATE webmasters to avoid publicly displayed email addresses in the first place and stop bots from crawling to end the spambots once and forever. Anything else, like a honey pot, is just patching around the problem and not putting a permanent stop to the disease.

NO EMAIL ADDRESSES SHOULD BE DISPLAYED ON WEB PAGES EVER!

GET IT?

GOT IT?

GOOD!

Now move on.

Tuesday, July 11, 2006

Gaisbot crawls via Yahoo China

I've been waiting for an excuse to block Yahoo China and this shit tips the scales.

202.165.96.134
[r2.mk.cnb.yahoo.com.]
"Gaisbot/3.0+(robot@gais.cs.ccu.edu.tw;+http://gais.cs.ccu.edu.tw/robot.php)
What the fuck Yahoo?

Why is GAIS Labs of National Chung Cheng University crawling from your IP block?

This is BULLSHIT!

Tireman crawling or breached?

Here's one that makes no sense to me as it appears the Tireman Auto Centers came crawling today.

72.240.44.21 [thetireman.com.] requested 29 pages as "Java/1.5.0_07"
It appears to be a dedicated from from ev1servers.net which is a hot bed of trouble anyway.

ev1s-207-44-205-180.ev1servers.net
Another reason I block anything from ev1servers.net

Rightscripts are Leakey

Some UK script kiddie Leakey Yung has a website called Rightscripts where he sells a bunch of bullshit scripts like Extract Website, just what we need, along with some SEO tools with live demos you probably don't want aimed at your site.

Their server is:

rightscripts.com (64.202.163.153)
If they try to run the demo from their server it shows as a different IP:
64.202.165.132
wc02.inet.mesa1.secureserver.net
"" [blank user agent]
This mess runs on GoDaddy servers:
OrgName: Go Daddy Software, Inc.
NetRange: 64.202.160.0 - 64.202.191.255
I've seen them appear to use 2 IPs for a proxy also from GoDaddy twice on my site:
64.202.165.201
"Rightscripts"
Proxy VIA=1.0 wc04.inet.mesa1.secureserver.net:3128 (squid/2.5.STABLE12) FORWARD=64.202.163.153

64.202.165.131
"Rightscripts"
Proxy VIA=1.0 wc01.inet.mesa1.secureserver.net:3128 (squid/2.5.STABLE12) FORWARD=64.202.163.153
Then of course, there appears to be some asshole that actually bought this script that aimed it at my site:
65.254.32.58
[sandiego.hostingforme.com]
"Rightscripts"
Of course "hostingforme" is a bullshit shell page that says nothing, another shocker, owned by some idiot scraper from India.
hostingforme.com (72.9.235.194)

Domain Name: HOSTINGFORME.COM
Registrant:
Ankur Motreja
399, 5th cross,
Gokulam III stage
Mysore, Karnataka 570002
India
Reinforces why I blocked GoDaddy's hosting farm in the first place and lock out all these idiot scripts by default.

Juniper Networks Security Lab

Just about once a day something Juniper Networks has been playing with comes and asks for a page or two and goes away. Never asks for robots.txt or anything like that so no clue if it's a crawler or what.

208.223.208.181
[security-lab1.juniper.net.]
"Python-urllib/1.16"
Perhaps it goes away when the user agent is blocked and it gets the "go away you're bothering me" page, but it will come back again and again and has been doing so for months.

If anyone from Juniper wants to comment on what in the hell your "security lab" is doing, perhaps I'll let it pass thru if it's a worthy cause. However, considering they aren't even bright enough to change the default user agent, forget that last thought, we'll just keep blocking it.