Source : http://wiki.apache.org/nutch/RunNutchInEclipse
Before you start
Setting up
Nutch to run into Eclipse can be tricky, and most of the time you are
much faster if you edit Nutch in Eclipse but run the scripts from the
command line. However, it's very useful to be able to debug Nutch in
Eclipse and is also extremely useful when applying and testing patches
as it enables you to see them working in a larger context. This being
said, you will still benefit greatly by looking at the hadoop.log
output.
This tutorial covers a fully internal Eclipse/Nutch set up, using only Eclipse tools and associated plugins.
Prerequsites
- Grab the newest version of Eclipse availble here.
- All of the following should be available from the Eclipse Marketplace. However if not, you can download them throughout Eclipse as follows.
- Once you've set up Eclipse, download Subclipse as per here. N.B. If you experience an error with the 1.8.x release, try 1.6.x. This tends to solve compatibility problems.
- Grab IvyDE plugin for Eclipse as here.
- Grab m2e plugin for Eclipse here
Contents
Steps
Install Nutch
Use the Subclipse plugin to check out the latest Nutch Trunk development.
- File > New > Project > SVN > Checkout Projects from SVN
- Create new repository location > https://svn.apache.org/repos/asf/nutch/trunk
- Subclipse will ask some additional configuration options, at this stage checkout the trunk source as a project configured using the New Project Wizard. Ensure that you're checking out the HEAD revision, then progress to Finish.
- The Wizard will prompt you to choose a project, so navigate to Java > Java Project > next
- Enter your Project name (trunk) and ensure that the create separate folders for sources and class files option is activated.
- Set the Default output folder to trunk/bin > Finish. Subclipse will then set your build paths and begin checking out the Nutch trunk source from the SVN area.
- Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory and that Nutch has not built the /runtime directory N.B. This is absolutely essential.
Establish the Eclipse environment for Nutch
- Ensure that you're in the Package Explorer > right click on Trunk Project folder.
- The only Source folder will be trunk/src > Remove this folder > Add Folder > expand trunk/src and check src/bin, src/java, src/test & src/testresources.
- In additon, we must maunally add EVERY individual plugin src/java and src/test folder, although this takes some time it is absolutely essential that this is done.
- In the Libraries tab, click Add Class Folder and add src/conf to the classpath.
- Still in the Libraries tab add JARs > src/plugin/urlfilter-automaton/lib/automaton.jar & src/plugin/parse-swf/lib/javaswf.jar
- Remaining in the Libraries tab Add Library > IvyDE Managed Dependencies > browse to trunk/ivy/ivy.xml > ensure ALL configuration boxes are included.
- Go to "Order and Export" tab, find the entry for added "conf" folder (it will most likely be at the bottom of the list) and move it to the top (by checking it and clicking the "Top" button). This is required so Eclipse will take config (nutch-default.xml, etc.) resources from our "conf" folder and not from somewhere else.
- DO NOT add "build" to classpath
- Click the "Finish" button
Configure Nutch
- see the Tutorial and follow all configuration steps, ensure that you DO NOT undertake any crawling. The directory structure for Nutch trunk enables us to edit nutch-site.xml.template, nutch-default.xml and regex-urlfilter.txt.template in our /conf directory, these properties will then be automatically built into our /runtime build folder.
- ensure that you change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-site.xml
- Once we have ensured that Nutch trunk is correctly configured we can progress to building within Eclipse.
Build Nutch
- We can now progress to building Nutch by simply dragging the build.xml file into the Ant view and double clicking on the build file. If you configured the project correctly, Eclipse will build Nutch for you into "bin" and you should see something similar to the following:
BUILD SUCCESSFUL Total time: 33 seconds
At
this stage it is advisable to right click on the project within the
package explorer and click on the refresh option. This will now reveal
the new runtime directory. As we previously configured various
configuration setting all we need to do is add the seed directory to our
/runtime/local directory then we are ready to crawl.
Create Eclipse launcher
- Menu Run > "Run..."
- create "New" for "Java Application"
- set in Main class
org.apache.nutch.crawl.Crawl
- on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
- in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
- click on "Run"
- if all works, you should see Nutch getting busy at crawling
Debug Nutch in Eclipse
- Set breakpoints and debug a crawl
- It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:
Fetcher [line: 1115] - run Fetcher [line: 530] - fetch Fetcher$FetcherThread [line: 560] - run() Generator [line: 443] - generate Generator$Selector [line: 108] - map OutlinkExtractor [line: 71 & 74] - getOutlinks
If things do not work...
Yes, Nutch and Eclipse can be a difficult companionship sometimes
eclipse: Cannot create project content in workspace
The Nutch
source code must be out of the workspace folder. Alternatively you can
download the code with eclipse (svn) under your workspace rather than
try to create the project using existing code, eclipse sometimes doesn't
let you do it from source code into the workspace.
plugin directory not found
Make sure
you set your plugin.folders property correct, instead of using a
relative path you can use a absolute one as well in nutch-default.xml or
even better in nutch-site.xml. Ideally all efforts should be made to
keep nutch-defult.xml completely intact.
<property> <name>plugin.folders</name> <value>/home/....../trunk/src/plugin</value>
No plugins loaded during unit tests in Eclipse
During unit
testing, Eclipse ignored conf/nutch-site.xml in favor of
src/test/nutch-site.xml, so you might need to add the plugin directory
configuration to that file as well.
classNotFound
- open the class itself, rightclick
- refresh the build dir
debugging Hadoop classes
Sometimes
(fairly often) it makes sense to also have the Hadoop classes available
during debugging. This should really second nature as Nutch heavily
relies upon the underlying Hadoop infrastructure. Therefore you can
check out (svn) the Hadoop sources into your Eclipse IDE and combine to
debug this way. You can:
- Checkout the Hadoop version that should be used within Nutch trunk
- configure a Hadoop project similar to the Nutch project within your Eclipse IDE
- add the Hadoop project as a dependent project of Nutch project
- you can now also set break points within Hadoop classes like inputformat implementations etc.
No comments:
Post a Comment
Thank you for Commenting Will reply soon ......