Infinite Programming Tips: Nutch Crawl Command

Monday, December 12, 2011

Nutch Crawl Command

-dir : dir names the directory to put the crawl in.
-threads : threads determines the number of threads that will fetch in parallel.
-depth : depth indicates the link depth from the root page that should be crawled.
-topN : N determines the maximum number of pages that will be retrieved at each level up to the depth.

bin/nutch crawl urls -dir crawl -depth 5 -topN 50

This will start crawl with text file put in urls directory, and put the crawled data in crawl folder(will create if it does not exist) and go up to depth 5 and will crawl maximum of 50 pages.

If we have a Solr already set up and wish to index to it, we are required to add the -solr <solrUrl> parameter to your crawl command and it will automatically index the crawled data in solr.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 4

Infinite Programming Tips

Monday, December 12, 2011

Nutch Crawl Command

No comments:

Post a Comment

Featured Posts

Run Commands for Windows