Install Hadoop on a Single Node
- Install an instance of Amazon EC2 with Amazon Linux
A micro EC2 instance may not have enough memory to run Hadoop in pseudo-distributed mode
- Login to the created instance
- Install & upgrade software needed by Hadoop
cd
sudo yum install rsync
sudo yum install java-1.6.0-openjdk
- Download and install the latest stable Hadoop
Replace the URL with the latest stable version
- Move the installation to home
sudo mv hadoop-0.20.2 /home
- (Optional) For better security, create a new user to own the Hadoop
sudo useradd -s /bin/bash -m hdp
sudo chown -R hdp.hdp /home/hadoop-0.20.2
- Test hadoop installation
sudo su hdp
cd /home/hadoop-0.20.2
bin/hadoop
Hadoop Standalone Operation for Debugging
Standalone operation runs Hadoop and your application inside a single
Java process. Files are read/write directly from the local file system
instead of the Hadoop Distributed File System. It is good for debugging
you application during development.
- Prepare a directory for the testing data file
cd /home/hadoop-0.20.2
mkdir test
echo "this is a first test file" > test/f1.txt
echo "a second test file" > test/f2.txt
- Run the hadoop with the wordcount program in hadoop-0.20.2-examples.jar
- Use the data files in the directory test and output the result to the directory output
bin/hadoop jar hadoop-*-examples.jar wordcount test output
- Verify the result
cat output/*
a 2
file 2
first 1
is 1
second 1
test 2
this 1
Hadoop Pseudo-Distributed Operation
Pseudo-distributed operation runs Hadoop on a single node. All Hadoop
daemons will run on different Java process on the same machine
Configure SSH for Hadoop
Hadoop uses SSH to manage its node(s)
- Copy and paste your Amazon AWS SSH private key (*.pem) to id_dsa
- Change the access right
chmod og-rw ~/.ssh/id_dsa
- Test the access
Firewall Configuration
Open the following firewall ports
- For the specific IPs or security group required accessing the UI tools
Port |
Description |
50070 |
UI admin tool for the NameNode |
50030 |
UI admin tool for the JobTracker |
50075 |
File browser |
Hadoop Pseudo-Distributed Configuration
- Edit the NameNode information in conf/core-site.xml. NameNode
responsible for managing the meta-data used in Hadoop Distributed File
Systems
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- Use the local machine for the NameNode
- Edit the Hadoop distributed file system (HDFS) in conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- Since there is only 1 node, send the data chunk to 1 node only
- Edit the MapReduce configuration in conf/mapred-site.xml. Set the
JobTracker to the localhost. JobTracker is responsible of sending job
requests to the node(s)/slave(s)
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
- Format the HDFS if it never been done
bin/hadoop namenode -format
11/04/21 18:40:47 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ip-10-15-21-1/10.15.12.30
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
Re-format filesystem in /tmp/hadoop-ec2-user/dfs/name ? (Y or N) Y
11/04/21 18:40:57 INFO namenode.FSNamesystem: fsOwner=ec2-user,ec2-user,wheel
11/04/21 18:40:57 INFO namenode.FSNamesystem: supergroup=supergroup
11/04/21 18:40:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/04/21 18:40:57 INFO common.Storage: Image file of size 98 saved in 0 seconds.
11/04/21 18:40:57 INFO common.Storage: Storage directory /tmp/hadoop-ec2-user/dfs/name has been successfully formatted.
11/04/21 18:40:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-10-195-222-31/10.195.222.31
************************************************************/
[ec2-user@ip-10-195-222-31 hadoop-0.20.2]$
- By default, a new HDFS Storage is created under /tmp/hadoop-ec2-user/dfs/name
- Start all Hadoop daemons
- For debugging, verify any exceptions in logs/*
- Check out the NameNode and JobTracker UI tool
- Put the local test directory to HDFS as directory input
bin/hadoop fs -put test input
- Test it
bin/hadoop jar hadoop-*-examples.jar wordcount input output
- Copy HDFS file to local directory and verify the result
bin/hadoop fs -get output output
cat output/*
- Or verify the result directly on HDFS
bin/hadoop fs -cat output/*
- Shutdown the Hadoop
|
No comments:
Post a Comment
Thank you for Commenting Will reply soon ......