Hadoop Data Balancing |
Hadoop Balancer:
This is tool provided
to balance the disk uses throughout the Hadoop cluster. I may happen sometime
that some of the nodes in the cluster becomes over utilized or underutilized,
which occurs due to addition of new nodes where newly added nodes may be
underutilized or if there are less number of nodes result in overutilization.
We can run balancer from more than 1 machine in the cluster to increase the
speed of balancing but it will increase bandwidth uses to very high.
This tool requires administrator
right on the Hadoop cluster to run.
Syntax of the
balancer:
bin/start-balancer.sh
[-threshold <threshold>]
Where
start-balancer.sh files resides in the bin directory of the Hadoop folder. And the
threshold is the parameter which decides target of balance, this lies in
fraction between 0,1 the default value is 10% if nothing is passed as the threshold
value.
This process does the transferring
of blocks between the nodes resulting network activity and if a production
cluster must be used cautiously, as it result in some block missing error or late
reply from the cluster.
This process can be
stopped any time if required using following command:
bin/stop-balancer.sh
And it can be stopped
at the machine where its running using:
bin/hadoop-daemon.sh
stop balancer
This command can be
used any time to stop the balancing process if required in case of error, or
delay in response, it is advised to use this when there is minimal or less
requests or activity on the cluster.
Cluster is said to be
balanced if for each Datanode, the utilization of the node (ratio of used space
at the node to total capacity of the node) differs from the utilization of the
cluster (ratio of used space in the cluster to total capacity of the cluster)
by no more than the threshold value. The smaller the threshold, the more
balanced a cluster will become. It takes more time to run the balancer for
small threshold values. Also for a very small threshold the cluster may not be
able to reach the balanced state when applications write and delete files
concurrently.
Running balancer always
increases the network activity and if used aggressively my cause network
congestion and degraded response, so default bandwidth to be used and be
limited by using parameter “dfs.balance.bandwidthPerSec” this parameter is found in hdfs-default.xml
and we can add a new parameter in hdfs-site.xml to override this value. The default
value for this is 1Mb/s and can be changed accordingly according to the network
bandwidth available.
It can be added as
follows:
<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>1000000</value>
</property>
If activity is is less
on the cluster then this value can be set to higher value to fasten the
balancing process and if activity is more, then can be reduced as to avoid network congestion
and errors.
It can be run in the
case we:
·
Add a rack to the cluster.
·
Add a node to a cluster.
·
If some disks are underutilized.
·
Make changes to the default bandwidth value
according to network of yours.
To be avoided:
·
Don’t make threshold too low or too high.
·
Don’t run from Namenode machine but run it from
some other Datanode machine.
Note:
·
Balancer automatically stops and exits if
cluster is already balanced or finish balancing.
·
In case there is not block to move or can’t be
moved.
·
If a block can be moved in 3 tries.
Happy Balancing J
ReplyDeleteThanks for this blog. provided great information. All the details are explained clearly with the great explanation.
hadoop training in chennai