All the question that scared me now i am trying to scare them .. so that they cant scare others :)
Friday, January 31, 2014
Wednesday, January 29, 2014
Some optimization trics for hadoop & mapreduce
Here are the some parameters which we can use to optimize and utilize hadoop and maprduce in a bit better way.
these parameter and their values are not fixed and the optimization and different parameter test must be done to optimize closely Hadoop according to the set up and type of machines in the cluster.
io.sort.factor-->64
io.sort.mb-->254
Mapred.reduce.parallel.copies
-->(number of machines*number of mappers)/2 (generally)
mapred.tasktracker.(map|reduce).task.maximum
-->map less than cores(if8cores then 5-10) (generally)
-->reduce(less than mapper, 4-6-8) (generally)
-->number of map+reduce>number of cores (generally)
mapred.(map|reduce).task.speculative.execution-->true
-->Same task to be executed on more than one machine in parallel
Tasktracker.http.threads
-->HTTP threads should be enough to support parallel copies in sort and snuffle phase.
We can use LZO compressed.
Use combiner
Impliment a custom partioner
Input split
~64-128-256(size of each file or block)
MySQL: SUBSTRING_INDEX - Select Patterns
Consider, a MySQL table having values in a column like below:
SELECT location FROM geo LIMIT 3;
"location"
"India.Karnataka.Shimoga.Gopala"
"India.Karnataka.Bengaluru.BTM"
"India.Karnataka.Chikmaglore.Koppa"
"India.Karnataka.Shimoga.Gopala"
"India.Karnataka.Bengaluru.BTM"
"India.Karnataka.Chikmaglore.Koppa"
My requirement is to take only 4th value from each of the rows(such as, Gopala,BTM,Koppa).
I don't want to display remaining values.
Its same as what 'cut' command will do in Linux.
For this, we can use SUBSTRING_INDEX function.
SELECT SUBSTRING_INDEX(location,'.',-1) from geo LIMIT 3;
"location"
"Gopala"
"BTM"
"Koppa"
Syntax: SUBSTRING_INDEX(string,delimiter,count)
Here count means column number based on delimiter.
Negative value indicates that the column numbers calculated from right side.
So, if I give '-2' instead of '-1':
SELECT SUBSTRING_INDEX(location,'.',-2) from geo LIMIT 3;
"location"
"Shimoga.Gopala"
"Bengaluru.BTM"
"Chikmaglore.Koppa"
Monday, January 20, 2014
Hadoop over utilization of Hdfs
Do you face problem of over uses of HDFS for datanode, frequently it becomes 100% and hence result in a imbalance cluster, thinking of how to solve this problem, for this what we can do is put a parameter called "dfs.datanode.du.reserved" so this will reserve the non HDFS uses disk space and hence leaving the some space remaining for non HDFS uses and solving disk overuses of HDFS .
Thursday, January 9, 2014
Hadoop small file and block allocation
One
of the misconceptions about Hadoop is that smaller files (smaller than the
block size 64 MB default) will still use the whole block on the filesystem and
there will be space westage on hdfs. This is not the true in reality. The
smaller files occupy exactly as much disk space as they require(1 mb file at
local disk will somwhat same space on hdfs). But this does not mean that having
many small files will use HDFS efficiently. Regardless of the block size, its
metadata at namenode occupies exactly the same amount of memory. As a result, a
large number of small HDFS files (smaller than the block size) will use a lot
of the NameNode’s memory, thus negatively impacting HDFS
scalability and performance.
So HDFS blocks are not a storage allocation unit, but a replication unit.
List of different hadoop distribution.
Cloudera CDH,Manager, and Enterprise
Based on Hadoop 2, CDH (version 4.1.2 as of this writing) includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Mahout, Hue, and other open source tools (including the real-time query engine — Impala).
Hortonworks Data Platform
Sunday, January 5, 2014
Incriment variable in linux shell
Following are the listed methods by which we can increment the variables in shell script in looping statements:
- j=$((i++))
- j=$(( i + 1 ))
- j=`expr $i + 1`
Happy scripting :)
Subscribe to:
Posts (Atom)
Featured Posts
Enable shared folders in ubuntu in vmware?
To enable Shared Folders in Ubuntu (VM) on VMware , follow these steps: Step 1: Enable Shared Folders in VMware Settings Power...

-
Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleReadWrite"); job.setJarByClass(MyReadWriteJo...
-
All data is retrieved through a WitsmlServer instance which represents the WITSML server in the client program. There are three differe...