Monday, November 19, 2007

Live.com's Search Spam Hysteria and Area 131.107.0.*

There are a lot of recent posts from people reaching a near hysteria fever pitch over what appears to be Live.com scouring the 'net looking for black hat sites doing things like cloaking or worse.

What they're all posting about appears to be that MS Live.com is doing some stealth crawling that appears to be sending bogus query strings looking for pages that change their response based on the query, which is what cloaked web sites do, and display advertising related to the topic that brought you to the page.

However, I've seen a few thousand other mysterious page requests from that IP range which most of you probably haven't noticed that I'll share below, which may or may not be related, hard to say at this point.

Sometimes, but not always, the IP address claims to be coming via a proxy such as:

1.1 SEA-PRXY-02
1.1 SEA-PRXY-01
"1.1 NET-PRXY-03, 1.1 NET-PRXY-04"
1.1 NET-PRXY-04
1.1 RED-PRXY-30
... and more
Maybe some of this is unrelated, maybe it's totally relevant, who knows except MS and they aren't telling. However, starting as far back as 01/07/2007 my bot blocker started trapping what appeared to be stealth crawl activity in the 131.107.*** range:
01/07/2007 131.107.0.96
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)"

01/12/2007 131.107.0.95
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)"

01/15/2007 131.107.0.104
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; I
nfoPath.1; .NET CLR 2.0.50727)"
Then it appears a human responded to a bot challenge:
01/15/2007 15:56:38 RESPONSE 131.107.0.104
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; In
foPath.1; .NET CLR 2.0.50727)"
Then this BLANK user agent started hitting on the same day
01/15/2007 131.107.0.86 ""
Then the sudden challenges and responses on 131.107.0.104 happened again so maybe that really was a human behind at least one of those proxies, who knows.

The blank UA on 131.107.0.86 kept asking for thousands of pages for many weeks, including "/robot.txt" that made me giggle.

In the middle of all this there's this little nugget:
03/29/200 131.107.0.96 "Wget/1.8.1"
Then in March there's another rash of challenge's in 131.107.0.* and a single response on 131.107.0.104:
04/28/2007 RESPONSE 131.107.0.104
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; In
foPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"
What does it all mean? No clue yet...

Suddenly after months the blank UA's on the 131.107.0.104 megacrawl seem to come to a close.

Then we get this little gem:
05/30/2007 131.107.0.95 "LWP::Simple/5.805"
June has a mix of challenges and a couple of responses so humans may use that IP block every now and then.

Then these nuggets pop up:
07/10/2007 131.107.0.95 "Java/1.6.0_01"
07/10/2007 131.107.0.96 "Wget/1.8.1"
07/13/2007 131.107.0.86 "" the blank UA starts crawling again.
Blank UA shows up on other IPs:
07/23/2007 131.107.0.101 ""
07/23/2007 131.107.0.104 ""
07/23/2007 131.107.0.96 ""
07/24/2007 131.107.0.73 ""
07/26/2007 131.107.0.96 ""
07/27/2007 131.107.0.95 ""
Now one IP with blank UA crawls a few days:
10/16/2007 to 11/05/2007 131.107.0.104 ""
Then the PERL crawl begins:
11/15/2007 131.107.0.96 "libwww-perl/5.805"
11/16/2007 131.107.0.95 "libwww-perl/5.805"
And those last two IPs are still currently crawling as "libwww-perl/5.805" as I write this.

When you add it all up a couple of things that come to mind are that Microsoft is checking for cloaking, has some pet projects possibly being tested and/or they are checking to see how websites respond to a browser user agent vs. user agents that are normally blocked and it's probably a mix of all the above.

See the response from msndude msg#3442263 on WebmasterWorld:
First, we appreciate the concerns and issues that have been raised and apologize for any incovenience this might have caused.

Second, we want to explain what this is all about. The traffic you are seeing is part of a quality check we run on selected pages. While we work on addressing your conerns, we would request that you do not actively block the IP addreses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.

Please keep the feedback and thoughts coming as we will use this to help improve this process and make sure that it impacts your sites as little as possible.
Please tell me what gives you the right to scan thousands of pages without permission and then threaten to dump our ass if we don't let you run rampant without control over our website?

That's some pretty big balls even for Microsoft!

Since it's annoying some people for no sane reason I say go block the IP range and go back to sleep because Microsoft doesn't send enough traffic to put up with this abuse in the first place.

Besides, Microsoft has some damned explaining to do before they have any room to bully people as I've got quite the list of documented abuse from that IP range that would justify anyone blocking the bad behavior exhibited on 131.107.0.*.

That's my $0.02.

FIRST LOOK: Yahoo Crawler Using Firefox UA

Woke up this morning to find my bot blocker had bitch slapped 300+ crawl attempts by Yahoo using the following criteria:

74.6.22.170 [llf520057.crawl.yahoo.net.] requested 302 pages as
"Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071102 BonEcho/2.0.0.4"
Upon further examination it appears that this activity started on 11/17/2007 and the IP address used is a Yahoo proxy and some of the forwarded IPs were:
74.6.18.46 -> rz502516.crawl.yahoo.net.
74.6.18.160 -> rz502426.crawl.yahoo.net.
74.6.18.163 -> rz502429.crawl.yahoo.ne

a lot more 74.6.18.* IPs etc., you get the idea...
What was curious is the version of Firefox claimed to be Bon Echo which if I'm not mistaken was pre-release Firefox 2 code.

Didn't look like they were making screen shots based on todays activity unless they had already cached the images so I'm not sure what in the hell Yahoo's up to at this point.

Take a look in your logs as I find it hard to believe I'm the only one seeing this.