Saturday, March 08, 2008

Jayde NicheBot Crawls for iEntry's Web of Sites

Who out there remembers the Jayde directory?

Some of us submitted our sites to Jayde way back in '96 or '97, who knows exactly, and now our sites are being hit by something called the "Jayde NicheBot".

"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) Jayde NicheBot"
I was curious why some site I submitted to about 10 years ago was pinging my server all these years later so I did a little research to see what they'd been up to in the interim and they appear to have been very prolific, almost to domain park proportions.

Jayde is currently owned by iEntry.com and if you have McAfee SiteAdvisor enabled in your browser it goes RED meaning that iEntry has something negative on file with SiteAdvisor that says the following:
Feedback from credible users suggests that this site sends either high volume or 'spammy' e-mails.
Took a look and found someone that posted one of those 'spammy emails' with a ton of iEntry's domain names listed.

On iEntry's website they claim:
iEntry properties include more than 370 Web sites and over 100 e-mail newsletters that are viewed by more than 5 million users every month.
Did a quick search for their 370 sites and Yahoo finds over 170 of them.

It appears iEntry owns ExactSeek.com, sitepronews.com, webpronews.com, metawebsearch.com, seo-news.com (and forum), and a ton of directories, bunch of sites here, shitload of sites there, and last but not least here it's tied together with ISEDN.ORG

Google and Yahoo could find listings about my sites in a bunch of their directories which begs the question:

Why does Google and Yahoo index all those redundant directories?

I found references to my sites in about 40 of them, there's a shock, knock me over with a feather. About 40 sites was all Google and Yahoo would easily report, and the answer to the "why are they indexed?" question appears to be that the order of the listings in the directory are changed for the same content on a different site so it seems to be unique per directory as far as the search engines are concerned. Maybe there were other changes as well, I didn't look to deep.

However, I did check Live search which doesn't appear to be so gullible as it only reported the duplicate content in 5 sites.

Hey, submit your link, it's FREE and you can advertise too!

Hope I didn't blow out anyone's sarcasm meter with that last quip.

Friday, March 07, 2008

Slow Down Nosy SEO's and Snooping Competitors

Most webmasters spend a lot of time and effort working on marketing their website, or pay someone a lot of money to do this, yet don't do a few common sense things that keep lazy and nosy assed SEO's or other competitors from quickly analyzing all your hard work and simply stealing what you've done.

Not that you can completely stop them because much of the competitive information about who links to you is already public, collected by search engines and toolbars, but you can sure as hell make it a little more difficult to get the rest of the data they want.

Since the SEO Chicks published a list of competitive research tools to help those nosy SEO's snoop, I just thought it would be fair and useful to have a nice list of ways to stop as many of those those snooper tools as possible.

Block Archive.org - No need to let anyone see how your site evolved, snoop or even scrape through archive pages without your knowledge so block their crawler.

User-agent: ia_archiver
Disallow: /
Rumor has it that the ia_archiver may crawl your site anyway so adding it to your .htaccess file is a good precaution as well.
RewriteCond %{HTTP_USER_AGENT} ^ia_archive
RewriteRule ^.* - [F,L]
Block Search Engine Cache - Some people cloak pages and just show the search engines raw text yet show the visitors a complete page layout. Who cares, that's your business and a competitive edge you don't need to share, plus pages can be scraped from search engine cache as well, so disable cache on all pages.

Insert the following meta tag in the top of all your web pages:
<meta content='NOARCHIVE' name='ROBOTS'>
Block Xenu Link Sleuth - Why do you need people sleuthing your site? Screw 'em...

Add Xenu to your .htaccess file as well:
RewriteCond %{HTTP_USER_AGENT} ^ia_archive [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu
RewriteRule ^.* - [F,L]
Make Your Domain Registration Private - Why give the SEO's or any other competitor any clues to help them whatsoever?

Sign up with DomainsByProxy and this will make the nosy little bastards happy:
WHATEVERMYDOMAINNAME.COM
Domains by Proxy, Inc.
DomainsByProxy.com
15111 N. Hayden Rd., Ste 160, PMB 353
Scottsdale, Arizona 85260
United States
Restrict Access To Unauthorized Tools - Use .htaccess to white list access to your site and just allow the major search engines and the most popular browsers which will block many other SEO tools. If you don't understand the white list method and it scares you, there's a few good black lists around too.

This is a limited sample for informational purposes only just to give an idea how it works, see the thread linked above for more in depth samples by WebSavvy, just be cautious in implementing a white list as it's very restrictive:
#allow just search engines we like, we're OPT-IN only

#a catch-all for Google
BrowserMatchNoCase Google good_pass

#a couple for Yahoo
BrowserMatchNoCase Slurp good_pass
BrowserMatchNoCase Yahoo-MMCrawler good_pass

#looks like all MSN starts with MSN or Sand
BrowserMatchNoCase ^msnbot good_pass
BrowserMatchNoCase SandCrawler good_pass

#don't forget ASK/Teoma
BrowserMatchNoCase Teoma good_pass
BrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc., will punt Lynx, cell phones and PDAs, don't care
BrowserMatchNoCase ^Mozilla good_pass
BrowserMatchNoCase ^Opera good_pass

#Let just the good guys in, punt everyone else to the curb
#which includes blank user agents as well


order deny,allow
deny from all
allow from env=good_pass

Disclaimer: I don't use .htaccess for much so please don't ask for a complete file, this is just a sample as I use a more complex real-time PHP script to control access to my site.

Block Bots and Speeding Crawlers
- You can use something like the nifty PHP bot speed trap Alex Kemp has written or Robert Planks AntiCrawl. Just another layer of security piled on against snoops and scrapers that pretend to be MSIE or Firefox to avoid the white list or black list blocking in .htaccess.

Block Snoops From Robots.txt - Don't allow anyone other that your white listed bots to see your robots.txt file because it has other stuff in it that SEO snoops might find interesting, and it can become a security risk. Use a dynamic robots.txt file like this perl script on WebmasterWorld and just add the rest of your allowed bots to the code next to Slurp, Googlebot, etc.

Block DomainTools - since SEO's use it to snoop, no reason to let DomainTools have access so just block 'em.

Probably lot's of other things you should be blocking as well but this will give you a good start.

This list doesn't completely stop snoops from manually looking at your site, but it certainly stops all of those automated tools from ripping through all your pages, search engine or archive cache, and presenting a nice pretty report.

Heck, why should you help people take away your own money?

Start slowing them down today and stop the next up and comer from getting the info too easy.

UPDATE:

One more creative thing you can do to your website is cloak the meta tags so that only the search engines see them and disable the meta tags for normal visitors. Nothing really wrong with this because meta tags by definition are only for the search engines and snooping SEO's will be completely left in the dark when they can't see your meta keywords or description.

Especially if you combine cloaking meta tags with the NOARCHIVE option described above so then it's completely hidden from prying eyes.