Sunday, November 12, 2006

Here a Nutch, There a Nutch, Everywhere a Nutch Nutch

Nutch usage seems to be breeding faster than cousins in Kentucky so I figured it was time to post a sequel to the original How Much Nutch is Too Much Nutch.

Here's a complete breakdown on every IP that I've seen using Nutch with the actual word Nutch in the user agent for a grand total of 190 IP's crawling to date. Several of them like Cazoodle, MQBOT, and a few .EDU's are crawling from a block of IPs but the majority seem to be scattered all over the place.

Here's the list of all the creepy crawling Nutches:

124.32.246.36 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

124.32.246.45 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

128.208.3.173 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; raphael@unterreuth.de)

128.208.6.125 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.200 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.207 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.226 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.227 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.232 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.75 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.208.6.77 NutchCVS/0.8-dev (Nutch running at UW; http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)

128.95.1.189 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

128.97.88.68 ilial/Nutch-0.9-dev

128.97.88.70 ilial/Nutch-0.9-dev

129.242.19.138 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

129.34.20.19 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

129.78.64.106 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.137.86 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.202 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.205 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.206 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.211 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.212 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

13.1.139.213 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.125.102 asked/Nutch-0.8 (web crawler; http://asked.jp; epicurus at gmail dot com)

131.112.125.103 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

131.112.125.104 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.125.106 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.112.16.220 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

131.211.84.21 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

140.247.62.79 blogsearch/Nutch-0.9-dev

140.247.62.80 blogsearch/Nutch-0.9-dev

147.202.90.2 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

159.226.5.82 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.201 ilial/Nutch-0.9-dev

164.67.195.245 ilial/Nutch-0.9-dev

164.67.195.26 ilial/Nutch-0.9-dev

164.67.195.27 ilial/Nutch-0.9-dev

164.67.195.67 ilial/Nutch-0.9-dev

164.67.195.68 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

164.67.195.86 ilial/Nutch-0.9-dev

166.214.93.76 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

192.17.240.19 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.20 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.41 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.43 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.44 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.46 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.47 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.48 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.52 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.56 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.57 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu;
mqbot@cs.uiuc.edu)

192.17.240.58 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.60 MQBOT/Nutch-0.9-dev (MQBOT Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.71 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.74 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

192.17.240.76 MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://falcon.cs.uiuc.edu; mqbot@cs.uiuc.edu)

193.145.45.68 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.117 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.118 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.119 HouxouCrawler/0.8-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

193.203.240.120 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.121 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.203.240.122 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.252.148.51 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

193.42.229.3 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

195.72.131.70 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

195.72.131.72 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

195.72.131.73 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

195.72.131.80 HouxouCrawler/Nutch-0.8.2-dev (houxou.com's nutch-based crawler which serves special interest on-line communities; http://www.houxou.com/crawler; crawler at houxou dot com)

203.113.130.205 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.147.0.44 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.199.83.162 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

203.244.218.1 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

207.176.224.241 Nutch/Nutch-0.8.1

207.176.224.245 Nutch/Nutch-0.8.1

207.214.93.42 MyNutch/V 0.3 (JP's Nutch Test Search Engine; jpnutch at yahoo dot com)

208.64.57.65 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

210.174.3.130 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.196.73.193 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

210.245.31.15 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

210.245.31.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

211.152.34.34 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.101.97.63 test/Nutch-0.8.1 (test; www.apache.org; test@apache.org)

212.12.114.238 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.137.33.140 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.156.230.210 BilgiBetaBot/0.8-dev (bilgi.com (Beta) ; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

212.58.116.72 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.132.175.101 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.157.204.141 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

213.251.133.12 Misterbot-Nutch/0.7.1 (Misterbot-Nutch; http://www.misterbot.fr; nutch at misterbot.fr)

216.182.225.186 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)

216.182.236.46 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)

216.182.237.45 NutchEC2Test/Nutch-0.9-dev (Testing Nutch on Amazon EC2.; http://lucene.apache.org/nutch/bot.html; ec2test at lucene.com)

216.93.185.12 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

217.153.59.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

217.31.51.128 Megatext/Nutch-0.8.1 (Beta; http://www.megatext.cz/; microton@microton.cz)

218.25.39.81 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

220.130.191.231 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.232 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.233 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.234 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.235 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.236 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.237 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.238 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.239 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

220.130.191.240 Cazoodle/Nutch-0.9-dev (Cazoodle Nutch Crawler; http://www.cazoodle.com; mqbot@cazoodle.com)

221.114.253.210 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.116.237.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

221.221.237.35 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

222.173.249.33 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

222.173.249.33 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.222.153.250 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

24.6.168.184 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)

58.186.61.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.187.12.236 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.215.74.242 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.215.75.2 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

58.87.139.90 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.115 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.116 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

59.160.240.183 Nutch-test/Nutch-0.9-dev

59.160.240.184 Nutch-test/Nutch-0.9-dev

59.160.240.185 Nutch-test/Nutch-0.9-dev

59.176.10.136 NutchCVS/0.01-beta (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

60.248.9.114 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

61.135.151.175 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

62.129.132.47 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.168.188.151 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.40.33.173 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

62.40.36.87 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

63.133.162.98 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

63.246.7.209 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.105.36.210 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html;
nutch-agent@lists.sourceforge.net)

64.241.242.18 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.10 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.242.88.60 NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

64.34.172.78 BurstFind Crawler 1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; crawler@burstfind.com)

64.34.180.167 Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.38.10.26 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

64.71.164.125 Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://www.krugle.com/crawler/info.html; webcrawler@krugle.com)

65.220.67.9 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

65.92.160.39 JLA/Nutch-0.8.1 (beta; http://dynamic.com/index.htm; info at test.com)

66.132.240.180 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.132.249.23 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.15.68.234 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.207.120.226 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

66.243.31.34 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.111.28.139 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

67.184.246.61 Nutch/Nutch-0.8 (Nutch Test; none; none)

67.52.101.242 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.178.171.109 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)

68.178.202.79 test/Nutch-0.8.1 (Test robot; http://test.com; info at test.com>)

68.205.124.164 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.205.127.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

68.97.222.117 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.248.26.83 Comrite/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.36.233.8 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

69.55.233.28 Argus/1.1 (Nutch; http://www.simpy.com/bot.html; feedback at simpy dot com)

70.143.79.234 JPNutchTest/Nutch-0.9-dev-JP-0.1 (JP Nutch Test; jpnutch at yahoo dot com)

70.197.81.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.56.66.216 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.90.188.18 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

70.96.99.254 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.216.0.210 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.217.33.149 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.241.153.125 NutchCVS/0.7 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

71.35.163.79 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.0.207.162 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

72.2.25.66 abcxyz/Nutch-0.8 (nutchtesting; nutch; abc@xyz.com)

72.2.25.67 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

72.2.25.71 Nutch/Nutch-0.8

72.5.173.22 sdcresearchlabs-testbot/Nutch-0.9-dev (www.shopping.com/bot.html; researchbot@shopping.com)

72.51.37.148 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html;
nutch-agent@lucene.apache.org)

72.84.30.230 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

75.44.225.44 NutchCVS/0.06-dev (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)

81.173.148.94 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.173.155.210 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.203.142.109 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

81.93.168.211 TRankBot/Nutch-0.8.1 (T-Rank AS; http://www.trank.no/; robot at trank dot no)

83.246.79.28 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

84.191.111.92 NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

84.231.72.32 agent/Nutch-0.8 (http://lucene.apache.org/nutch/bot.html)

84.231.74.47 nutch/Nutch-0.8.1

85.117.62.114 NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

85.18.14.22 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

87.139.106.60 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)

88.191.23.109 NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
Wasn't that fascinating reading?

This is some crazy shit that's almost like a DoS attack of non-stop web crawlers and I suspect it will get even worse as more people try to mine the Internet for free money.

Load up the firewall and your .htaccess filters with protection and brace for impact.

2 comments:

Anonymous said...

Hi Bill:

What's your experience been of Nutch's respect for robots.txt?

If this has been positive,

User-agent: Nutch
Disallow: /

would be an efficient add.

As would adding the Nutch UA to your modsecurity rules, like:

SecFilterSelective HTTP_USER_AGENT "Nutch"

IncrediBILL said...

Nutch does seem to honor robots.txt but I'm running wide open just to see what's trying to crawl.

If I drop them in the filter rules I can't track who they are and what they do as they would just bounce off the server.

Therefore, I let crawlers access robots.txt and then when they try to crawl the site they get a single page (with no links to follow) telling them they aren't permitted.

It's not optimal and not standard, but it allows me to gather technical information and still stop the crawling.

My concern, given this number of them crawling, is how much bandwidth the collective bunch of them are burning for people that aren't even aware Nutch exists and it's escalating.