Thursday, December 22, 2011

Install Hadoop

Install Hadoop on a Single Node

  1. Install an instance of Amazon EC2 with Amazon Linux

    A micro EC2 instance may not have enough memory to run Hadoop in pseudo-distributed mode
  2. Login to the created instance
  3. Install & upgrade software needed by Hadoop
    cd
    sudo yum install rsync
    sudo yum install java-1.6.0-openjdk
  4. Download and install the latest stable Hadoop
    Replace the URL with the latest stable version
  5. Move the installation to home
    sudo mv hadoop-0.20.2 /home
  6. (Optional) For better security, create a new user to own the Hadoop
    sudo useradd -s /bin/bash -m hdp
    sudo chown -R hdp.hdp /home/hadoop-0.20.2
  7. Test hadoop installation
    sudo su hdp
    cd /home/hadoop-0.20.2
    bin/hadoop
      

Hadoop Standalone Operation for Debugging

 

Standalone operation runs Hadoop and your application inside a single Java process. Files are read/write directly from the local file system instead of the Hadoop Distributed File System. It is good for debugging you application during development.
  • Prepare a directory for the testing data file
    cd /home/hadoop-0.20.2
    mkdir test
    echo "this is a first test file" > test/f1.txt
    echo "a second test file" > test/f2.txt
  • Run the hadoop with the wordcount program in hadoop-0.20.2-examples.jar
    • Use the data files in the directory test and output the result to the directory output
      bin/hadoop jar hadoop-*-examples.jar wordcount test output
  • Verify the result
    cat output/*
    a       2
    file    2
    first   1
    is      1
    second  1
    test    2
    this    1

Hadoop Pseudo-Distributed Operation

Pseudo-distributed operation runs Hadoop on a single node. All Hadoop daemons will run on different Java process on the same machine

Configure SSH for Hadoop

Hadoop uses SSH to manage its node(s)
  1. Copy and paste your Amazon AWS SSH private key (*.pem) to id_dsa
    vi ~/.ssh/id_dsa
  2. Change the access right
    chmod og-rw ~/.ssh/id_dsa
  3. Test the access
    ssh localhost
    exit

Firewall Configuration

Open the following firewall ports
  • For the specific IPs or security group required accessing the UI tools
    Port Description
    50070 UI admin tool for the NameNode
    50030 UI admin tool for the JobTracker
    50075 File browser

Hadoop Pseudo-Distributed Configuration

  • Edit the NameNode information in conf/core-site.xml. NameNode responsible for managing the meta-data used in Hadoop Distributed File Systems
    <configuration>
         <property>
             <name>fs.default.name</name>
             <value>hdfs://localhost:9000</value>
         </property>
    </configuration>
    • Use the local machine for the NameNode
  • Edit the Hadoop distributed file system (HDFS) in conf/hdfs-site.xml
    <configuration>
         <property>
             <name>dfs.replication</name>
             <value>1</value>
         </property>
    </configuration>
    • Since there is only 1 node, send the data chunk to 1 node only
  • Edit the MapReduce configuration in conf/mapred-site.xml. Set the JobTracker to the localhost. JobTracker is responsible of sending job requests to the node(s)/slave(s)
    <configuration>
         <property>
             <name>mapred.job.tracker</name>
             <value>localhost:9001</value>
         </property>
    </configuration>
  • Format the HDFS if it never been done
    bin/hadoop namenode -format
    11/04/21 18:40:47 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = ip-10-15-21-1/10.15.12.30
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 0.20.2
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
    ************************************************************/
    Re-format filesystem in /tmp/hadoop-ec2-user/dfs/name ? (Y or N) Y
    11/04/21 18:40:57 INFO namenode.FSNamesystem: fsOwner=ec2-user,ec2-user,wheel
    11/04/21 18:40:57 INFO namenode.FSNamesystem: supergroup=supergroup
    11/04/21 18:40:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
    11/04/21 18:40:57 INFO common.Storage: Image file of size 98 saved in 0 seconds.
    11/04/21 18:40:57 INFO common.Storage: Storage directory /tmp/hadoop-ec2-user/dfs/name has been successfully formatted.
    11/04/21 18:40:57 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at ip-10-195-222-31/10.195.222.31
    ************************************************************/
    [ec2-user@ip-10-195-222-31 hadoop-0.20.2]$
    • By default, a new HDFS Storage is created under /tmp/hadoop-ec2-user/dfs/name
  • Start all Hadoop daemons
    bin/start-all.sh
    • For debugging, verify any exceptions in logs/*
  • Check out the NameNode and JobTracker UI tool
    NameNode: http://your_public_ip:50070/
    JobTracker: http://your_public_ip:50030/
  • Put the local test directory to HDFS as directory input
    bin/hadoop fs -put test input
  • Test it
    bin/hadoop jar hadoop-*-examples.jar wordcount input output
  • Copy HDFS file to local directory and verify the result
    bin/hadoop fs -get output output
    cat output/*
  • Or verify the result directly on HDFS
    bin/hadoop fs -cat output/*
  • Shutdown the Hadoop
    bin/stop-all.sh

No comments:

Post a Comment

Thank you for Commenting Will reply soon ......

Featured Posts

#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc

 #Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc Linux is an open-source operating system that is loved by millio...