Saturday, November 04, 2006

Hunting PicScout, the Copyright Crawler Getty Uses

Everyone knows about PicScout used by Getty Images but nobody seems to know anything about PicScout's crawler, no user agent information, no IP's where they crawl from, nothing. When someone asked me if I knew anything about them I did a little research and nothing related could be found ANYWHERE, not even anything initially obvious in my bot blocker log files. Based on my initial observations PicScout actually seemed to be hiding better than all the other corporate crawlers I've researched to date, but maybe we can shed some light on this.

Not that I advocate copyright violation, as a matter of fact, I'm a staunch copyright defender.

However, attempting to crawl under the radar, refusal to honor robots.txt files, or identify your bot in any fashion and bypass website security measures gets under my skin more than anything so I picked up the gauntlet and tried to find signs of PicScout activity.

After the usual simple research methods failed, I decided to start by seeing where they were hosted.

host picscout.com
picscout.com has address 82.80.254.37

host 82.80.254.37
37.254.80.82.in-addr.arpa domain name pointer bzq-80-254-37.dcenter.bezeqint.net.
Ah ha!

I remember a rash of activity I shut down from bezeqint.net a while back so I looked a little deeper into this angle.
inetnum: 82.80.248.0 - 82.80.255.255
netname: BEZEQINT-HOSTING
descr: BEZEQINT-HOSTING
country: IL
Ah yes, they're the guys from Israel that were hammering one of my servers.

I found a high volume of crawling from these IP's that was trapped by the bot blocker automatically and never answered the challenges, so it was definitely bot traffic.
82.80.249.195
82.80.249.196
82.80.249.197
82.80.249.201
82.80.249.202
82.80.249.203
82.80.249.204
82.80.252.130
These IPs have only been spotted using the two following user agents:
Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)
My theory is that this is PicScount attempting to crawl under the radar.

Check your logs people, see if you have any activity in this range, I think it's them.

I would just block this range out of principle at this point as those IPs crawling aren't honoring any internet standards, and if it is PicScout, blocking them could possibly save you a massive chunk of money if some web designer used stolen images building your website.

UPDATE:

After posting this the fine people from PicScout visited the blog and revealed more information about their facilities.

The log showed this visit:
Host Name mail.picscout.com
IP Address 62.0.8.2
Country Israel
ISP Nv-picscout
The information I found from that, including another IP block is here:
inetnum: 62.0.8.0 - 62.0.8.255
netname: NV-PICSCOUT
descr: NV-PICSCOUT
country: IL
admin-c: OG570-RIPE
tech-c: NN105-RIPE
status: ASSIGNED PA
mnt-by: NV-MNT-RIPE
mnt-lower: NV-MNT-RIPE
source: RIPE # Filtered
So, there's a few more IPs you might want to block, but I doubt they're scanning from the office.

UPDATE: Caught Getty keeping an eye on everyone today.

My blog log showed this:
Time: 12th June 200712:24:53 PM
Host Name outbound.gettyimages.com
IP Address 206.28.72.1
Country United States
Region Washington
City Seattle
ISP Getty Images
Referrer: http://www.webproworld.com/graphics-design-discussion-forum/56384-invoiced-getty-images-unlawful-use-images.html

It appears they were snooping on WebProWorld and followed the link here. The user agent claimed to be MSIE 6.0 but it's possibly an automated crawler, hard to say.

Anyway, we're watching you watch us, it works both ways.

Monday, October 30, 2006

Net::Trackback Rocks D-Block

Why is it every time someone puts some code out on the net like Net::Trackback that some asshole will download it and then aim their new creation at my server?

This is where they attempted to hammer my server this morning:

209.9.169.66 [209-9-169-66.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.78 [209-9-169-78.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.67 [209-9-169-67.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.70 [209-9-169-70.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.71 [209-9-169-71.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.69 [209-9-169-69.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.68 [209-9-169-68.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.72 [209-9-169-72.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.73 [209-9-169-73.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.75 [209-9-169-75.sdsl.cais.net.] "Net::Trackback/1.01"
209.9.169.74 [209-9-169-74.sdsl.cais.net.] "Net::Trackback/1.01"
Of course they got nothing but error message for their troubles, but this is still.... BULLSHIT!

Can't even research the source as ARIN.NET's website won't load at this moment and CAIS.NET never responds to WHOIS inquiries and just hangs like this:
[Querying whois.arin.net]
[Redirected to rwhois.cais.net:4321]
[Querying rwhois.cais.net]
Never got a response...

Bunch of BULLSHIT, that's what this is!

Sunday, October 29, 2006

Hand Spammers Waving the White Flag?

Ever since I implemented techniques to automatically moderate hand spammers (aka Indian SEO's) they seem to have noticed they aren't getting through and have gone away. The first couple of weeks it didn't seem like they were slowing down at all, but they were moderated at least so nobody else saw them. Then I made some other changes in how I'm handling spammers that still did it by hand and suddenly they are just gone.

Before the last few changes I easily had about 10 hand spams getting trapped as moderated posts a day, then suddenly nothing moderated has shown up for over a week now.

Did they just give up?

We shall see, but this is promising!