Friday, April 11, 2008

Project Rialto's PRCrawler Is Data Mining?

Since I whitelist allowed bots I've had Project Rialto blocked since the beginning but I was curious what they were doing since they first showed up on my radar on 01/23/2008 and kept coming back over and over.

From one of their job ads:

We are designing high-performance algorithms and developing reliable, fault-tolerant and scalable real-time systems that can handle massive volume of data for in-depth analysis of user behavior to enable targeted advertising.

and...

Research and investigate academic and industrial data mining, machine learning and modeling techniques to apply to our specific business case
Oh boy!

It appears they want to crawl our sites and use that information to shove more ads in our face.

Somehow, I don't think so...

If you're going to mine data, shouldn't you get the URLs right?

The site they're attempting to "mine" is on a Linux box and URLs are case sensitive and my URLs all have upper/lower case in them yet the PRCrawler only asks for those URLs in all lower case so even if I left them crawl my site they'd get nothing but 404s.

No wonder their home page says they're a "stealth company" because I'd hide too if I couldn't even get the proper case of the URLs right.

Here's their user agent:
"PRCrawler/Nutch-0.9 (data mining development project; crawler@projectrialto.com)"
They operate from the following IPs:
64.47.51.153
64.47.51.158
67.202.0.157
67.202.0.17
67.202.0.71
67.202.10.65
67.202.18.229
67.202.29.20
67.202.3.112
67.202.3.141
67.202.3.151
67.202.56.219
67.202.58.214
67.202.59.117
67.202.62.162
67.202.62.45
72.44.36.20
72.44.36.8
72.44.37.72
72.44.39.55
The first two were from masergy.com, the rest are all from compute-1.amazonaws.com.
host-64-47-51-153.masergy.com.
host-64-47-51-158.masergy.com.
I haven't seen anything from masergy.com since the initial contact but that's only 2 months ago so who knows.

Don't know where they primed the pump for their data mining operation since they already had lots of information about my site when they attempted to crawl, but since it was all lower case it was completely useless.

I'm just curious if they got it my URLs from somewhere already in lower case or someone there slapped a tolower() around a line of code when importing the URLs into Nutch.

Don't know, don't care, it's amusing either way.

Good luck with Project Rialto, you're going to need it.

2 comments:

Anonymous said...

just incase you didnt know Incredibill, these along side Phorm ,NebuAd,Front porch, and ADzilla have all been using Deep Packet Inspection/Interception For Profit and because of the way they datamine without consent etc, potentially "Commercial" copyright infringment.

see the Cable forum thread
http://www.cableforum.co.uk/board/12/33628733-virgin-media-phorm-webwise-adverts-updated-page-651.html#post34580706

and Alexanders Say NO! to Deep Packet Inspection
https://nodpi.org/

that is currently focused on Phorm given the Uk "Phorm Storm" is the focus right now, but will eventually on all these different DPI for Profit companies around the world....

Anonymous said...

Thanks for the info, I have also noticed strange behavior on my site from this useragent. Some how they have gotten the URLS for sections of my site that require logging in. Seems very shady I will also be blocking their ip addresses.