Saturday, August 19, 2006

Pyrex Detonates Dinner

Tonight we had to go out to dinner because the Pyrex dish in the microwave detonated like a small bomb. It did NOT just fall apart like they claim, I heard the noise 3 rooms away when the dish went >BANG!< in the microwave and rocked it.

OK, I'll grant you that this occurance was possibly a rapid temperature change caused by cooking in the microwave.

However, a couple of years ago we had a Pyrex 11"x17" 6-month old baking dish sitting in a cabinet that hadn't been used in many days and it just exploded with such force that the door to the cabinet was blown open, shards of glass flew across the room and there was glass even on the shelf ABOVE where this dish was sitting.

The fine people of World Kitchen (scroll down to Pyrex Responds) would like to have you believe that these dishes don't explode, so perhaps they would like to explain why I had to empty a shelf above the dish that didn't explode to clean up the shards of glass, or how glass landed in the carpet almost 8' away, or sprayed across our kitchen.

For my part, I'm done with Pyrex as 2 exploding dishes in a couple of years is just too dangerous to deal with.

We'll be replacing them with metal for baking or some alternative for the microwave in every case possible.

In the meantime, we're immediately moving all of our Pyrex to lower shelves by the floor as a safety precaution as I'd rather have glass in my foot instead of glass in my face from this volatile cookware.

[Update]

After doing some research we found this tidbit:

"Pyrex kitchen products produced by World Kitchen are no longer made from borosilicate glass, and their packaging indicates that they must never be used over a flame, on stove tops, under a broiler, or in a toaster oven. Hence the exploding glassware. Pre-1998 Pyrex tends to crack into large pieces rather than shattering. Keep your pre-1998 Pyrex and treat it nicely so it will last. Anything you buy now may blow up in your face! "
The 11x17 baking dish was post-1998 but obviously because of it's size wouldn't fit in the toaster over or even the microwave, and we NEVER use these for stove top cooking.

The smaller dish that cracked up in the microwave last night we've had since about '91 so it was the old-style Pryex and did crack into larger pieces, but did so quite loudly.

I'm just glad it broke in the closed microwave and not when my wife opened the door or was removing it from the microwave, that would've been a bad situation.

No more Pyrex for us, it's gone.

SCRAPER BUSTED #10 - Fashion Designs Made for AdSense

We have some fuckers scraping from, come on say it - your favorite source of scraping, Everyones Internet or ev1.net for those of you new to this shit.

Some real dumb fuckers too that put "User-Agent:" in front of the user agent just in case we weren't smart enough to know the user agent was a user agent which was reason enough to block all the assholes that do that shit in the first place.

Here's the stealth crawler:

66.98.132.68 "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3"
Here's a snippet from MSN:
Spring Fashion 2003 Check Out The Backpacker Watercolor Palette Or The ...
...by political and economic decentralisation, especially in countries with mixed and command economies. spring fashion 2003 Your IP Address: 66.98.132.68 User Agent:
devoll.roswellspringcatalog.info/spring-fashion-2003.html8/18/2006

Stupid fuckers, did you not think we would catch your made for adsense scraper bullshit?

Friday, August 18, 2006

Stealth Crawler from HP

Something stealth came crawling from HP, perhaps it's part of that PlanetLab fiasco, asked for 11 pages and 3 images, very bizarre behavior.

Asked for the 'about' page 3 times, robots.txt 2 times, index 2 times, and the privacy page, so definitely not a human and not a terribly clever bot either.

192.6.19.203 - "GET /robots.txt " "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
192.6.19.203 - "GET / " "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
Reverse DNS says:
nslookup 192.6.19.203
name = cache2.nlanr.hpl.hp.com
What's up HP, mind sharing your intentions with this activity?

Thursday, August 17, 2006

Found ACEZ in Placez, What is it?

Whatever ACEZ is, it always precedes what appears to be an actual page view so perhaps it's a filtering program of some sort. Additionally, it seems to only come via a proxy server.

Doesn't appear to be a bot but it's very strange.

Here's a few sightings:

220.227.148.74 "ACEZ" VIA=1.0 localzeesports.localzs.com:8080 (squid/2.5.STABLE1) FORWARD=192.168.10.178

212.138.113.12 "ACEZ" VIA=1.1 proxy2 (NetCache NetApp/5.3.1R4), 1.0 cache2.ruh FORWARD=213.165.59.253

212.138.113.13 "ACEZ" VIA=1.1 proxy2 (NetCache NetApp/5.3.1R4), 1.0 cache3.ruh FORWARD=213.165.59.253

212.138.47.17 "ACEZ" VIA=1.1 proxy2 (NetCache NetApp/5.3.1R4), 1.0 cache7.ruh FORWARD=213.165.59.253

212.138.47.14 "ACEZ" VIA=1.1 proxy2 (NetCache NetApp/5.3.1R4), 1.0 cache4.ruh FORWARD=213.165.59.253

Here's a sample of how it appears in a log file before a page is read:
212.138.113.13 - "GET /page1.html" "-" "ACEZ"
212.138.113.13 - "GET /page1.html" "" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
Any additional info would be nice if anyone knows anything about it.

Evolving Stealth Bots

Just in the last few weeks I've been seeing some really odd hits on robots.txt from things claiming to be browsers, loading images like a browser, the whole nine yards.

Bot #1 - Post-crawl Robots.txt Reader

What I'm seeing is that instead of looking at robots.txt upfront, which is a trigger to shut down a bot, I'm seeing robots.txt read after one or two pages is read. That way, they can snoop my robots.txt file but not do it first therefore avoiding being stopped while collecting a safe page or two in order to find out what my pages are for a future crawls.

That's my theory and I wouldn't have considered this the case except I've seen the exact same behavior multiple times.

Time to start setting some new traps and see who crawls with the information gathered from these probes.

Bot #2 - 3 Phase Crawler

Next on the list is a stealth bot that looks like it's either taking a screen shot on the first page or downloading images just to try and trick my software into thinking it's human.

This beast does the following:

  1. Reads robots.txt with a blank user agent string
  2. Loads the home page as Linux Firefox and downloads all associated images which appears to be taking a screen shot
  3. Crawls the rest of the pages on the site disguised as Internet Explorer
Bot #3 - Blank User Agent Probe

Here's something amusing with what appears to be a Ukrainian spider that downloaded a linked image to my website as Internet Explorer and 4 seconds later hit robots.txt as an anonymous user agent.
82.207.93.90 - [12:20:47] "GET /banner.gif" "http://www.someotherwebsite.com" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 4.0)"

82.207.93.90 - [12:20:51] "GET /robots.txt" "-" "-"
This may be related to Bot #2 above, not sure, but I've seen a few hits like this where they follow the link and peek to see what's allowed and don't go any further.

Very odd.

Tuesday, August 15, 2006

Link Checkers Don't Understand

Having a few conversations that are going nowhere with some link checker sites.

ME: "Sorry but I had to block your link checker as you're never going to find what you want as I can't allow any of you to crawl 40K pages. Would you mind just telling me what you want to find and I can tell you exactly where it is?"

Link Checkers: "Just point us to your links page with robots.txt"

ME: "The whole site is links, it's a directory, and robots.txt is EXCLUSION only, not INCLUSION, so I can't tell you where to crawl only where NOT to crawl which is impractical with 40K pages anyway."

Link Checkers: "We stop after X pages anyway."

ME: "You're still wasting my bandwidth as the odds of finding what you're looking for in the top level pages is real slim. How about telling me who you want in the referrer field and I'll just redirect your crawler to the exact page you need."

Link Checkers: "Error, does not compute, too logical, error, error, erroooooooorrrrrr...."

So there you have my current state of impasse with the link checking community.

As soon as they can come up with a compromise I'll unblock them, but until then NADA PAGE!

FIRST LOOK - GenericBot-ax 0.85 at SurfControl

It's always cool to have an EXCLUSIVE on a new bot caught fresh in the traps this morning.

This little beast was crawling from SurfControl's IP range:

195.244.16.1 "GenericBot-ax 0.85"
Here's the 411 on the IP address:
inetnum: 195.244.16.0 - 195.244.17.255
netname: SURFCONTROL
descr: SurfControl PLC
country: GB
e-mail: karl.jones@surfcontrol.com
Didn't ask for robots.txt and asked for the home page 3 times in a row, about a minute apart.

What they didn't expect was their SurfControl met MY surf control and they got a swift kick in the ass.

NO DATA FOR YOU!

Buh bye.

Multiple Scrape Attempts from Google IPs?

OK, anyone can shed any light on this would be nice, web accelerator may?

Had a batch of "Avant Browser" requests, none got answered because of this SNAFU request early on that tripped the bot trap, yet they just kept coming:

64.233.173.89 - "GET /#top" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Avant Browser; Avant Browser; .NET CLR 1.0.3705)"
Google didn't even respond properly to reverse DNS, sloppy shit:
nslookup 64.233.173.89

** server can't find 89.173.233.64.in-addr.arpa: NXDOMAIN
But it's certainly a Google IP:
whois 64.233.173.89

OrgName: Google Inc.
OrgID: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US

NetRange: 64.233.160.0 - 64.233.191.255
Then look at THIS one also from Google, what the hell?
72.14.194.19 - "GET /robots.txt" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6"
Same reverse DNS problem:
nslookup 72.14.194.19

Non-authoritative answer:

*** Can't find 19.194.14.72.in-addr.arpa.: No answer
Just to make sure it wasn't my servers, I checked DNSSTUFF.com, same result.

Yet, it's Google:
whois 72.14.194.19

OrgName: Google Inc.
OrgID: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US

NetRange: 72.14.192.0 - 72.14.255.255
OK, someone from Google got a clue what in the hell is going on?

Anyone?

This is unacceptable whatever it is!

Monday, August 14, 2006

Another Yahoo Proxy Hijacking

Since our old buddy John think's I have a bad attitude about proxy sites and they shouldn't be blocked then we'll use him as an example and replace the actual data found in Yahoo with John's website.

John, how would you like your site being Hijacked in Yahoo like this?

  1. ... Yahoo has crawled via proxy IP 74.52.14.138 to hijack your site John, deal with it!
    Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) ...
    gizliweb.com/g/o.web/010010A/http:/www.johnon.com
I doubt this will change his mind about proxies but those Google Ads on the top of his page sure look pretty!

Spam Gilad, It's All His Fault

As the website proclaims:

Gilad, this is all your fault!!!

Technically, just leaving the unattended guestbook online full of nothing but spam was Eli's fault.

This wouldn't even be so amusing except the guestbook is full of spam links pointing to our favorite scraper malware sites.

Take a peek in Yahoo to see the scope of the spamming just for one of the domains such as xanax-shop.info.

I let Yahoo know about the list of dirty scraper malware dogs last week so it will be amusing to see how long they remain in the index, especially since many of the domains in the list try to do harm to surfers.

Thanks to Olliver for pointing out this link titled GooglePray over at Spamhuntress' site.