Infinite Programming Tips: Easy Install HBase/Hadoop in Pseudo Distributed Mode

Sunday, November 27, 2011

Easy Install HBase/Hadoop in Pseudo Distributed Mode

Introduction

This documentation should get you up and running with a full pseudo distributed Hadoop/HBase installation in an Ubuntu VM quickly. I use Ubuntu because the Debian Package Management (apt) is by far the best way to install software on a machine. It is possible to also use this on regular hardware as well.

The reason why you will need this is because much of the existing documentation is spread around quite a few different locations. Thus, I've already done the work of digging this information out so that you don't have to.

This documentation is intended to be read and used from top to bottom. Before you do an initial install, I suggest you read through it once first.

Reference Manuals

Create the virtual machine

The first thing that you will want to do is download a copy of the Ubuntu Server 10.04 64bit ISO image. This version is the current Long Term Support (LTS) version. These instructions may work with a newer version, but I'm suggesting the LTS because that is what I test with and also what your operations team will most likely want to install into production. Once you have the ISO, create a new virtual machine using your favorite VM manager (I like vmware fusion on my Mac).

Unix Box Setup

Once you have logged into the box, we need to setup some resources...

echo "deb http://archive.canonical.com/ lucid partner" > /etc/apt/sources.list.d/partner.list

echo "deb http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list

echo "deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list

echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections

echo "hdfs  -       nofile  32768" >> /etc/security/limits.conf

echo "hbase  -       nofile  32768" >> /etc/security/limits.conf

echo "hdfs soft/hard nproc 32000" >> /etc/security/limits.conf

echo "hbase soft/hard nproc 32000" >> /etc/security/limits.conf

echo "session required  pam_limits.so" >> /etc/pam.d/common-session

aptitude install curl wget

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

aptitude update

aptitude install openssh-server ntp

aptitude install sun-java6-jdk

aptitude safe-upgrade

reboot now

You can now use ifconfig -a to find out the IP address of the virtual machine and log into it via ssh. You will want to execute most of the commands below as root.

LZO Compression

This setup provides LZO compression for your data in HBase which greatly reduces the amount of data which is stored on disk. Sadly, LZO is under the GPL license, so it can't be distributed with Apache. Therefore, I'm providing a nice debian that I got ahold of for you to use. On your vm:

1 2	`wget` `"https://github.com/lookfirst/fileshare/blob/master/Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb?raw=true"` `dpkg -i Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb`

Hadoop / HDFS

Install some packages:

1

2

3

apt-get install hadoop-0.20

apt-get install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker

apt-get install hadoop-0.20-conf-pseudo

Edit some files:

/etc/hadoop/conf/hdfs-site.xml

1

2

3

4

<property>

   <name>dfs.datanode.max.xcievers</name>

   <value>4096</value>

</property>

/etc/hadoop/conf/core-site.xml

<property>

   <name>io.compression.codecs</name>

   <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>

</property>

<property>

   <name>io.compression.codec.lzo.class</name>

   <value>com.hadoop.compression.lzo.LzoCodec</value>

</property>

/etc/hadoop/conf/mapred-site.xml

<property>

   <name>mapred.compress.map.output</name>

   <value>true</value>

 </property>

 <property>

   <name>mapred.map.output.compression.codec</name>

   <value>com.hadoop.compression.lzo.LzoCodec</value>

 </property>

 <property>

   <name>mapred.child.ulimit</name>

   <value>1835008</value>

 </property>

 <property>

   <name>mapred.tasktracker.map.tasks.maximum</name>

   <value>2</value>

 </property>

 <property>

   <name>mapred.tasktracker.reduce.tasks.maximum</name>

   <value>2</value>

 </property>

ZooKeeper

1	`apt-get` `install` `hadoop-zookeeper-server`

/etc/zookeeper/zoo.cfg

1 2	`Change localhost to 127.0.0.1` `Add: maxClientCnxns=0`

1	`service hadoop-zookeeper-server restart`

HDFS/HBase Setup

Make an /hbase folder in hdfs

1

2

3

4

sudo -u hdfs hadoop fs -mkdir /hbase

sudo -u hdfs hadoop fs -chown hbase /hbase

NOTE: If you want to delete an existing hbase folder, first stop hbase!

sudo -u hdfs hadoop fs -rmr -skipTrash /hbase

HBase Installation

1 2	`apt-get` `install` `hadoop-hbase` `apt-get` `install` `hadoop-hbase-master`

/etc/hbase/conf/hbase-site.xml

<property>

   <name>hbase.cluster.distributed</name>

   <value>true</value>

</property>

<property>

   <name>hbase.rootdir</name>

   <value>hdfs://localhost/hbase</value>

</property>

/etc/hbase/conf/hbase-env.sh

1

2

3

export HBASE_CLASSPATH=`ls /usr/lib/hadoop/lib/cloudera-hadoop-lzo-*.jar`

export HBASE_MANAGES_ZK=false

export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

/etc/hadoop/conf/hadoop-env.sh

1	`export` `HADOOP_CLASSPATH="$HADOOP_CLASSPATH"`:`hbase classpath`

Now, restart the master and start the region server:

1 2	`service hadoop-hbase-master restart` `apt-get` `install` `hadoop-hbase-regionserver`

Starting/Stopping everything

Start

service hadoop-zookeeper-server start
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
service hadoop-hbase-master start
service hadoop-hbase-regionserver start

Stop

service hadoop-hbase-regionserver stop
service hadoop-hbase-master stop
for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done
service hadoop-zookeeper-server stop

Hbase Shell

1 2	`su` `- hbase` `hbase shell`

Ports

To ensure that everything is working correctly, visit your VM's ip address with these ports on the end of a http url.

HDFS: 50070
JobTracker: 50030
TaskTracker: 50060
Hbase Master: 60010
Hbase RegionServer: 60030

Infinite Programming Tips

Sunday, November 27, 2011

Easy Install HBase/Hadoop in Pseudo Distributed Mode

No comments:

Post a Comment

Featured Posts

🌫️ Project Title: "Fog Buster – AI-Powered Visibility Enhancement System"