WebHosting

Monday, December 12, 2011

Nutch Crawl Command


  • -dir : dir names the directory to put the crawl in.
  • -threads : threads determines the number of threads that will fetch in parallel.
  • -depth : depth indicates the link depth from the root page that should be crawled.
  • -topN : N determines the maximum number of pages that will be retrieved at each level up to the depth.
  bin/nutch crawl urls -dir crawl -depth 5 -topN 50






This will start crawl with text file put in urls directory, and put the crawled data in crawl folder(will create if it does not exist) and  go up to depth 5 and will crawl maximum of 50 pages.

 If we have a Solr already set up and wish to index to it, we are required to add the -solr <solrUrl> parameter to your crawl command and it will automatically index the crawled data in solr.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 4

No comments:

Post a Comment

Thank you for Commenting Will reply soon ......

Featured Posts

Error Message in DBeaver connecting using jdbc: Public Key Retrieval is not allowed

Fixing “Public Key Retrieval is not allowed” Error in MySQL with DBeaver   If you are trying to connect MySQL 8+ with DBeaver and suddenly...