Monday, December 12, 2011

Nutch Crawl Command


  • -dir : dir names the directory to put the crawl in.
  • -threads : threads determines the number of threads that will fetch in parallel.
  • -depth : depth indicates the link depth from the root page that should be crawled.
  • -topN : N determines the maximum number of pages that will be retrieved at each level up to the depth.
  bin/nutch crawl urls -dir crawl -depth 5 -topN 50






This will start crawl with text file put in urls directory, and put the crawled data in crawl folder(will create if it does not exist) and  go up to depth 5 and will crawl maximum of 50 pages.

 If we have a Solr already set up and wish to index to it, we are required to add the -solr <solrUrl> parameter to your crawl command and it will automatically index the crawled data in solr.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 4

No comments:

Post a Comment

Thank you for Commenting Will reply soon ......

Featured Posts

#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc

 #Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc Linux is an open-source operating system that is loved by millio...