- -dir : dir names the directory to put the crawl in.
- -threads : threads determines the number of threads that will fetch in parallel.
- -depth : depth indicates the link depth from the root page that should be crawled.
- -topN : N determines the maximum number of pages that will be retrieved at each level up to the depth.
This will start crawl with text file put in urls directory, and put the crawled data in crawl folder(will create if it does not exist) and go up to depth 5 and will crawl maximum of 50 pages.
If we have a Solr already set up and wish to index to it, we are required to add the -solr <solrUrl> parameter to your crawl command and it will automatically index the crawled data in solr.
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 5 -topN 4
No comments:
Post a Comment
Thank you for Commenting Will reply soon ......