Tuesday, July 03, 2007

Google Proxy Hijacking - Myths, Urban Legends and Raw Truths

If you aren't a regular Webmaster World reader then you probably missed the most recent incarnation on the Google Proxy Hijacking situation where I had to step in and correct a lot of misinformation about Proxy Hijacking.

Go read the following:
Proxy Server URLs Can Hijack Your Google Ranking

Lots of good information there once you weed through all the misconceptions.

If you read that entire thread and still have any questions, feel free to ask!

14 comments:

Anonymous said...

Interesting indeed.. thnx for your info at WMW forum.

Anonymous said...

Good reading Bill. The code posted at http://www.webmasterworld.com/apache/3381827.htm seens like a good basic starting point.

I use Drupal as my CMS and I'm in the proccess of writing a very basic module. The module logic will be something like:
- Randonmly pick a number between 5 and 20 and once the number is reached ( pages veiwed ) the user gets a Captcha
- Request more than X number of pages in a given period of time, whether or not you have had a Captcha, you are getting a Captcha.
- SE bots that pass the RDNS won't see the Captchas

I know it's basic but, it should catch more than I'm catching now and it will save alot of time.

Anonymous said...

Ugh. That will just annoy the hell out of your users. Especially if every fetch counts. A single page with 16 images, a style sheet, and two Javashit script files inlined would be unviewable as trying to load it would keep going to your stupid captcha instead.

There are battles that are worth fighting and battles that aren't. Iraq and the battle to exert absolute control over who accesses information you publish using what tools are two that aren't. :P

IncrediBILL said...

Ignore the last asshole as I insert random CAPTCHA's after x-number of page accesses and it works very well catching stupid bots and barely ever stops a real human.

I find the best number of pages to use as the CAPTCHA point is the average mean # of pages the average viewer requests so that anyone skewing that number is probably a bot and/or a power user, someone not upset over a little test.

The problem you have is exceptions need to be made for IPs on shared services such as AOL, modem pools, Shaw Cable, etc.

Besides, people that mix topics comparing protecting your content with Iraq are more than likely whiny asses scrapers and fucking assholes at a minimum, so ignore them and continue to fight the good fight online.

Anonymous said...

Those whose idea of "protecting" something involve strangling it to death in the attempt to stop anyone else doing the same are misguided. Those who do so despite the fact that nobody else actually CAN are downright paranoid.

You're not considering repeat visitors here either. I'd be annoyed if I visited a site and browsed for a bit, decided something was interesting there, bookmarked it, and came back days later and hit the captcha threshold after just a further page or two. A frequent visitor with a stable IP address would get a captcha quite often, from the total cumulative page views ratcheting ever higher. You'd be especially annoying to your most loyal viewers, which is a terrible idea.

Captchas are for guarding posting/submission forms that can be used to generate spam. They should never appear to someone that is merely reading and browsing.

IncrediBILL said...

Strangling?

Who died and made you the CAPTCHA police?

A captcha is a test to make sure it's a human vs. a bot and there's no rules about when or where it can be used.

Pack up your old school mentality as it'e being done on more sites that you might imagine and whether it annoys you or not, we really don't care, admission is free, get over it.

Anonymous said...

I heard about this tropical vacation spot you might want to visit sometime Bill. Admission is free, and so is the ferry ride you'll need to take to cross the border and enter the, er, vacation spot. Don't mind the somewhat spooky appearance of the ferryman, as he's really, er, mostly harmless anyway. And if you have any complaints, like the air conditioning is broken, the management is sure to be helpful; just address any and all complaints to "Beelzebub".

Have a nice vacation.

IncrediBILL said...

You really need to get a job if you have that much time to write a post just to tell someone to "go to hell"

moron

Unknown said...

Hey Bill,

thanks for the great post.

What I'm currently not aware of is, how you get rid of all those proxysites that allow indexing of the cached pages.

One eg is proxydust dot com which caused me some dupl content issues for my company site.

I would appreciate it if you could get back to me on that, as those sites don't ID as false googlebots, hence the RFdns check doesnt work

best,christoph

IncrediBILL said...

PROXYDUST appears to just pass thru the user agent as-is, hard to say without seeing an actual hijacking if they do something special with Googlebot.

Anyway, they operate out of uk2net and the easiest way to make sure you've got all their IPs is to just block the entire data center.

inetnum: 83.170.96.0 - 83.170.111.255
netname: UK2-NET
route: 83.170.96.0/20

Unknown said...

Hey Bill,

yeah - that was an easy job to do - right now I let one employee search all all sorts of sites like that to block on your FW here..

Do you have an idea on how to automate that? I mean they appear to as normal visitors

thanks
christoph

IncrediBILL said...

Automating it is sometimes proxy and behavior specific, nothing I could tell you how to do in a quick post.

Some of them actually slip through the cracks for a while until they reveal themselves so it's not 100% bulletproof.

The only way to get most of them is to simply block all hosting centers.

Unknown said...

Hey bill,

Dan Thies had a post about "Google Proxy Hacking" today explaining all but the way to get rid of these proxy sites not sending a Bot agent with them... imho that's the real current problem with url hijackers...

christoph

Pao said...

Hi incredibill

Your answer got my attention, you said
"If you validate Googlebot (or any other crawler) with reverse/forward DNS checking the proxy hijacking simply goes away"

Please, Would you tell me How?

I read the Google Proxy Hacking, but the solutions seem to be complicated... =( Even I found a plugin ... but I don't see any change


Thank you