Thursday, December 22, 2011

Install Hadoop

Install Hadoop on a Single Node

Install an instance of Amazon EC2 with Amazon Linux

A micro EC2 instance may not have enough memory to run Hadoop in pseudo-distributed mode
Login to the created instance

Install & upgrade software needed by Hadoop

cd
sudo yum install rsync
sudo yum install java-1.6.0-openjdk

Download and install the latest stable Hadoop

wget sudo http://mirror.its.uidaho.edu/pub/apache//hadoop/core/stable/hadoop-0.20....
tar -zxvf hadoop-0.20.2.tar.gz

Replace the URL with the latest stable version

Move the installation to home
```
sudo mv hadoop-0.20.2 /home
```

(Optional) For better security, create a new user to own the Hadoop

sudo useradd -s /bin/bash -m hdp
sudo chown -R hdp.hdp /home/hadoop-0.20.2

Test hadoop installation

sudo su hdp
cd /home/hadoop-0.20.2
bin/hadoop
  

Hadoop Standalone Operation for Debugging

Standalone operation runs Hadoop and your application inside a single Java process. Files are read/write directly from the local file system instead of the Hadoop Distributed File System. It is good for debugging you application during development.

Prepare a directory for the testing data file

cd /home/hadoop-0.20.2
mkdir test
echo "this is a first test file" > test/f1.txt
echo "a second test file" > test/f2.txt

Run the hadoop with the wordcount program in hadoop-0.20.2-examples.jar
- Use the data files in the directory test and output the result to the directory output
  bin/hadoop jar hadoop-*-examples.jar wordcount test output

Verify the result

cat output/*
a       2
file    2
first   1
is      1
second  1
test    2
this    1

Hadoop Pseudo-Distributed Operation

Pseudo-distributed operation runs Hadoop on a single node. All Hadoop daemons will run on different Java process on the same machine

Configure SSH for Hadoop

Hadoop uses SSH to manage its node(s)

Copy and paste your Amazon AWS SSH private key (*.pem) to id_dsa
```
vi ~/.ssh/id_dsa
```
Change the access right
```
chmod og-rw ~/.ssh/id_dsa
```
Test the access
```
ssh localhost
exit
```

Firewall Configuration

Open the following firewall ports

For the specific IPs or security group required accessing the UI tools

Port Description

50070 UI admin tool for the NameNode

50030 UI admin tool for the JobTracker

50075 File browser

Port	Description
50070	UI admin tool for the NameNode
50030	UI admin tool for the JobTracker
50075	File browser

Hadoop Pseudo-Distributed Configuration

Edit the NameNode information in conf/core-site.xml. NameNode responsible for managing the meta-data used in Hadoop Distributed File Systems
```
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>
```
- Use the local machine for the NameNode
Edit the Hadoop distributed file system (HDFS) in conf/hdfs-site.xml
```
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
```
- Since there is only 1 node, send the data chunk to 1 node only
Edit the MapReduce configuration in conf/mapred-site.xml. Set the JobTracker to the localhost. JobTracker is responsible of sending job requests to the node(s)/slave(s)
```
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>
```

Format the HDFS if it never been done

bin/hadoop namenode -format

11/04/21 18:40:47 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ip-10-15-21-1/10.15.12.30
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
Re-format filesystem in /tmp/hadoop-ec2-user/dfs/name ? (Y or N) Y
11/04/21 18:40:57 INFO namenode.FSNamesystem: fsOwner=ec2-user,ec2-user,wheel
11/04/21 18:40:57 INFO namenode.FSNamesystem: supergroup=supergroup
11/04/21 18:40:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/04/21 18:40:57 INFO common.Storage: Image file of size 98 saved in 0 seconds.
11/04/21 18:40:57 INFO common.Storage: Storage directory /tmp/hadoop-ec2-user/dfs/name has been successfully formatted.
11/04/21 18:40:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-10-195-222-31/10.195.222.31
************************************************************/
[ec2-user@ip-10-195-222-31 hadoop-0.20.2]$

By default, a new HDFS Storage is created under /tmp/hadoop-ec2-user/dfs/name

Start all Hadoop daemons
```
bin/start-all.sh
```
- For debugging, verify any exceptions in logs/*

Check out the NameNode and JobTracker UI tool

NameNode: http://your_public_ip:50070/
JobTracker: http://your_public_ip:50030/

Put the local test directory to HDFS as directory input
```
bin/hadoop fs -put test input
```

Test it

bin/hadoop jar hadoop-*-examples.jar wordcount input output

Copy HDFS file to local directory and verify the result
```
bin/hadoop fs -get output output
cat output/*
```
Or verify the result directly on HDFS
```
bin/hadoop fs -cat output/*
```
Shutdown the Hadoop
```
bin/stop-all.sh
```

Infinite Programming Tips

Thursday, December 22, 2011

Install Hadoop

Install Hadoop on a Single Node

Hadoop Standalone Operation for Debugging

Hadoop Pseudo-Distributed Operation

Configure SSH for Hadoop

Firewall Configuration

Hadoop Pseudo-Distributed Configuration

No comments:

Post a Comment

Featured Posts

✨ Tired of the same old Windows Start Menu and Taskbar?