Monday, December 30, 2013

Get file permission in numeric form in linux

We can use the following command to get the file permission in numeric form instead of string from.

stat -c "%a %n" *

Where 

-c   --> specify custom format
%a --> Access rights in octal format
%n --> file name

Sunday, November 10, 2013

MySQL: Replication - Skip sql statement from execution

In any MySQL Replication set up, some or the other time we come across errors on the Slave due to which, replication stops.  

If the error is due to manual intervention on replication, we may have to skip the current SQL transaction on Slave which caused the error.

In the case, below is the command to be executed:

SET GLOBAL sql_slave_skip_counter=1;
START SLAVE;



You can increase the counter value if you want to skip multiple consecutive transactions.

Friday, October 4, 2013

Prevent accidental data loss in Hadoop

Some time you have accidentally delete some file which you are not suppose to do. So what will you do in that case? 

The option with you is to enable trash (Recycle bin) for hadoop, and define fs.trash.interval 
to one day with 1440 Minutes or more or less according to the cost of data you host in hadoop.

You can do it in following way :

Open core-site.xml and paste following parameter 
  

Wednesday, September 25, 2013

Starting/Stoping Hadoop and HBase in order

Hadoop Startup Process

Namenode  -->  Datanodes  -->  SecondaryNamenode  -->  JobTracker  -->  TaskTrackers

Hadoop Shutdown Process
JobTracker  -->  TaskTracker  --> Namenode  -->  Datanodes  --> SecondaryNamenode

HBase Startup Process

Zookeepers  -->  HMaster  -->  RegionServers

HBase Shutdown Process

RegionServers  -->  HMaster  -->  Zookeepers

Thursday, September 19, 2013

Too many fetch-failures, hadoop mapreduce job, hadoop mapreduce Too many fetch-failures

This may happen due to following reasons:

1. Wrong DNS Entry, Hosts file entries, 

Description:

Nodes are not able to communicate with each other, so check if nodes where this error comes you can ping or  nslookup to that node

2. Tasktracker http thread

Description:

Check of the value of this tag in mapreduce-site.xml, if it is lower make it somewhat higer around 100.

Tuesday, September 17, 2013

Get Directory path from a file path/Get file name from a filepath in shell Linux/Unix

Get Directory path from a file path.

Command : 

dirname /path/path1/filename

Output :

 /path/path

Get filename from a filepath

Command:

basename /path/path/filename

Output:

filaname

Friday, August 23, 2013

Getting thread count of a process using PID in Linux

Using following command we can get the thread count of a process in Linux, using PID of the process.

ps -o thcoung -p <PID>

If we need some more detail like process type, pid, user under which the process is running we can use following command :

ps -o user,comm,pid,thcount -p <PID>


Tuesday, August 6, 2013

Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Table 'xxxxxx' doesn't exist

FAILED: Error in metadata: MetaException(message:Got exception: javax.jdo.JDODataStoreException Exception thrown obtaining schema column information from datastore)
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: javax.jdo.JDODataStoreException Exception thrown obtaining schema column information from datastore

Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Table 'xxxxxx' doesn't exist

Follow following to solve this problem.

Friday, July 26, 2013

The path "" is not a valid path to the 3.5.0-17-generic kernel headers / VMWare can't find linux headers path

Error/Problem while installing VMware Tool in Virtual Machine

If you this error falls in front of you (Ubuntu)

Throw this command infront of it :)

sudo apt-get install build-essential linux-headers-`uname -r` psmisc

-or-

sudo apt-get install linux-headers-$(uname -r)


If anyone of this command will executed successfully, then the error will run away from you :)

Thursday, July 11, 2013

Setting HeartBeat Interval for Datanode

Setting up the following in your hdfs-site.xml will give you 1-minute timeout.
<property>
 <name>heartbeat.recheck.interval</name>
 <value>15</value>
 <description>Determines datanode heartbeat interval in seconds</description>
</property>
If above doesn't work - try the following (seems to be version-dependent):
<property>
 <name>dfs.heartbeat.recheck.interval</name>
 <value>15</value>
 <description>Determines datanode heartbeat interval in seconds.</description>
</property>

Timeout equals to 2 * heartbeat.recheck.interval + 10 * heartbeat.interval. Default forheartbeat.interval is 3 seconds.

Monday, July 8, 2013

Thursday, July 4, 2013

How to copy files from one Hadoop Cluster to another ?

Suppose if you want to copy files from hadoop clusters you have three options :

1 : copy file to local and then copy from local using
  
 copyToLocal and then copyFromLocal
  
 -get and -put

But not a good option.

So we have another option:

-cp and distcp

Distcp will require Map-reduce to be running if you dont want to run Mapreduce on your cluster you have other option that is -cp

Uses:
hadoop dfs -cp hdfs://<source> hdfs://<destination>

if you want faster copy use distcp for that your job tracker and task tracker must be running.

distcp uses:

hadoop distcp hdfs://<source> hdfs://<destination>.

Thursday, June 27, 2013

List top 10/n biggest/smallest files on hadoop (size wise)

List top 10 biggest files in a directory on hadoop:
hadoop dfs -du /testfiles/hd|sort -g -r|head -n  <N>  {N here is the top number of file you want to list}
      
hadoop dfs -du /testfiles/hd|sort -g -r|head -n  10

List top 10 biggest file on hadoop(Recursively) :


hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|head -n <N> {N here is the top numbers of files you want to list}
hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|head -n 10


List top 10 smallest files in a directory on hadoop:
hadoop dfs -du /testfiles/hd|sort -g -r|tail -n  <N>  {N here is the top number of file you want to list}
      
hadoop dfs -du /testfiles/hd|sort -g -r|tail -n  10

List top 10 smallest file on hadoop(Recursively) :


hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|tail -n <N> {N here is the top numbers of files you want to list}
hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|tail -n 10

Wednesday, June 26, 2013

Hadoop version support matrix, hadoop Hbase Version compatibility


HBase-0.92.xHBase-0.94.xHBase-0.95
Hadoop-0.20.205SXX
Hadoop-0.22.x SXX
Hadoop-1.0.0-1.0.2 SSX
Hadoop-1.0.3+SSS
Hadoop-1.1.x NTSS
Hadoop-0.23.x XSNT
Hadoop-2.x XSS
 HBase requires hadoop 1.0.3 at a minimum; there is an issue where we cannot find KerberosUtil compiling against earlier versions of Hadoop.

Where
S = supported and tested,
X = not supported,
NT = it should run, but not tested enough.  

Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its lib directory. The bundled jar is ONLY for use in standalone mode. In distributed mode, it is critical that the version of Hadoop that is out on your cluster match what is under HBase. Replace the hadoop jar found in the HBase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch issues. Make sure you replace the jar in HBase everywhere on your cluster. Hadoop version mismatch issues have various manifestations but often all looks like its hung up.

Saturday, June 22, 2013

Keep your locate database updated in linux

We may use find command to find a file in linux but it is a bit slower than Locate, 
locate is a command that help us to find a file efficiently and a bit faster, 
it maintains a database called

"/var/lib/mlocate/mlocate.db" 

and also has a utility called "updatedb" which keeps this database updated with the 
file which

are there in the system. so when we run this command this command runs it update 
the database

and updates this database with newly file added.  This database is kept by linux system to 
keep most updated file list.

Wednesday, June 19, 2013

if you want a file not to be modified by anyone even the root user

Suppose you have a file which you never want any one to delete, modify, move here is the option you have......

This is called an immutable file which is supported in linux ext2/ext3

which you can do as follow:

suppose you want to create a file called configuration which you never wanted to be modified move or removed the what you can do is follows:

chattr +i configuration

after doing that this file configuration cant be deleted,moved or changed

if you want it to simple file again you  can do

chattr -i configuration

Wednesday, May 29, 2013

Recursively list all files in a directory Linux



find -follow -type f   à find all files recursively type file

ll –R  à long list recursively

ls –LR à List all recursively

tree –l  à tree structure  

find /dir -type f -follow -print| xargs ls –l   Ã  file and list all files

Friday, May 24, 2013

Find and move list of file spread over diffrent location on a linux machine


#----------------------------------------------------------------#
#  USES : sh fileFinderMover.sh <filewithfilenames> <targetpath> #
#----------------------------------------------------------------#

#!bin/sh
for i in `cat $1`
do
path=`locate $i`
mv $path $2
echo $path
done



This script will seach for the file which give as list and will move to target location.

Wednesday, May 22, 2013

hadoop oiv : hadoop offline image viewer, get a list of all files on hadoop


Replace a word in file Linux Shell

sed -i 's/<word to replace>/<new word>/g'<filenametobereplacedin>

Eg:

sed -i 's/shashwat/shriparv/g'shashwat.txt

Better way to get a list of all files on Hadoop, using shell.





Hadoop provides an option OIV  that is offline image viewer, which can read the Hadoop image file and output it to a output file in human readable format.


Syntax and uses:


bin/hadoop  oiv –i <Hadoop image file name> -o <output file in human readable form>


Options with this command:


There is an additional option to output file format that is defined using switch –p <format>, the formats can be: -->
 

Saturday, April 13, 2013

List Files from hdfs/Hadoop Recursively using java


import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;

/**
 *
 * @author Shashwat Shriparv
 * @email  dwivedishashwat@gmail.com
 * @web    helpmetocode.blogspot.com
 */
public class RecursivelyPrintFilesOnHDFS {

    public static void main(String[] args) throws IOException, InterruptedException, URISyntaxException {
        printFilesRecursively("hdfs://master1:9000/");
    }

    public static void printFilesRecursively(String Url) throws IOException {
        try {

Write file to HDFS/Hadoop Read File From HDFS/Hadoop Using Java


import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.DistributedFileSystem;

/**
 *
 * @author    Shashwat Shriparv
 * @email     dwivedishashwat@gmail.com
 * @Web       helpmetocode.blogspot.com
 */
public class WritetoHDFSReadFromHDFSWritToLocal {
    private static byte[] buffer;
    private static int bytesRead;

    public static void main(String[] args) throws IOException, InterruptedException, URISyntaxException {
      

Thursday, April 4, 2013

Insert string after each N lines in a file

We can do this as follows:

awk '1;!(NR%<Number after which the line has to be insserted>){print "String to be inserted";}' origionalfiletoprocess >outfilewithinsertedstring

Eg:

awk '1;!(NR%100){print "Shashwat Shriparv";}' filecontainingtxt>outputfilewithnewinsertedlines

This command will read filecontainingtxt and will insert string Shashwat Shriparv after 100 lines the the output fill will be outputfilewithnewinsertedlines

Thursday, March 28, 2013

WebHDFS REST API

The HTTP REST API supports the complete FileSystem interface for HDFS.

Operations

For More Please Visit WebHDFS



Jobtracker API error - Call to localhost/127.0.0.1:50030 failed on local exception: java.io.EOFException

Try the port number listed in your $HADOOP_HOME/conf/mapred-site.xml under the mapred.job.tracker property. Here's my pseudo mapred-site.xml conf

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

If you look at the JobTracker.getAddress(Configuration) method, you can see it uses this property if you don't explicitly specify the jobtracker host / port:

public static InetSocketAddress getAddress(Configuration conf) {
String jobTrackerStr =
conf.get("mapred.job.tracker", "localhost:8012");
return NetUtils.createSocketAddr(jobTrackerStr);
}


Tuesday, March 26, 2013

See content of Tar/tar gz file without extracting it

Have you ever needed to peek inside a tar.gz file in UNIX/Linux terminal when you don't have a GUI option to do so, if so here is the solution you will want to use for it :)

tar -tf tarfileyouwanttopeekin.tar
or
tar -tf tarfileyouwanttopeekin.tar.gz
or
tar -tf tarfileyouwanttopeekin.tar.bz2


if you look into a Zip file :

unzip -l tarfileyouwanttopeekin.zip

Thursday, March 14, 2013

Adding Scheduler to Hadoop Cluster

 

As we know we we execute task or jobs on hadoop it follows FIFO Scheduling, but if you are in multi user hadoop environment the you will need better scheduler for the consistency and correctness of the task scheduling.

Hadoop comes with other schedulers too those are:

Fair Schedulers : This defines pools and over time; each pool gets around the same amount of resources.

Capacity Schedulers : This defines queues, and each queue has a guaranteed capacity. The capacity scheduler shares computer resources allocated to a queue with other queues if those resources are not in use.

For changing the scheduler you need to take your cluster offline and make some configuration changes, first make sure that the correct scheduler jar files are there. In older version of hadoop you need to put the jar file if not ther in lib directory but from hadoop 1 these jars available in the lib folder and if you are using the newer hadoop good news for you Smile

Steps will be:

Using C++ or C to interact with hadoop

Are you a C++ or c programmer and you are not willing to write java code to interact with Hadoop/HDFS ha? Ok you have an option that is : llbhdfs native library that enables you to write programs in c or cpp to interact with Hadoop.

Current Hadoop distributions contain the pre-compiled libhdfs libraries for 32-bit and 64-bit Linux operating systems. You may have to download the Hadoop standard distribution and compile the libhdfs library from the source code, if your operating system is not compatible with the pre-compiled libraries.

For more information read following:

http://wiki.apache.org/hadoop/MountableHDFS

https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

http://xmodulo.com/2012/06/how-to-mount-hdfs-using-fuse.html

Writing code in cpp or c Follows:

Finding out block location and block size of file on HDFS

 

Have you ever needed to find out the Block location and Block size for a file which is lying on hdfs hadoop? if so here is the command you can use to find out that.

For that we need “fsck” command which hadoop provide.

Here goes the command:

bin/hadoop fsck /filepath/filenameonhdfs –block –files  -location

This command will provide information about Block location on what data node its lying, what are the blocks for that file

Just go and play with the command you will understand more. Smile

Bench marking Hadoop


You have completed setting up your Hadoop cluster?? Now its time to benchmark it, as this will help you to understand that your Hadoop is production ready, and configured properly.
So how to proceed?

The benchmarking programs are there in Hadoop*-test.jar file which you can call.

So lets try TestDFSIO : this will test the read/write performance of HDFS.

How we can run this?

Wednesday, March 13, 2013

jps not working: jps: command not found

Either install Open JDK

or

do following

Create an alias for JPS using following:

alias jps='/usr/lib/jvm/jdk1.6.0_33/bin/jps'
else if you want to see java process running execute following command

ps -ef | grep java
or
ps -aux| grep java

Thursday, February 28, 2013

Rebalancing your hdfs, hadoop cluster

Are you stuck in a scenario where replication factor is not correct, or not like what you expect it to be?

You can go for re balancing of hdfs, what you can do is:

Suppose you have replication factor as 2 but some files are showing replication factor as 2 or 3 or 1 and you want your replication factor to be 2.

Just increase the replication factor and then decrease for your hdfs root file system recursively.

suppose you need to increase the replication factor 2 and you are having replication factor as 1, and hadoop is not automatically replicating these blocks than you can increase the replication factor to 3 and then again decrease the replication factor to 2

Use following command to increase and then decrease the replication factor.

Increasing:

hadoop dfs -setrep -R 3 /      -----> this will increase the replication factors of all the files to 3 and replicate automatically once you have enough replication you can decrease the replication factor to stabalise the cluster as you needed

Decreaseing:

hadoop dfs -setrep -R 2 /     ------>  this will make the replication factor to 2 recursively for your hdfs root partition.

this method you can apply for a single file or a specific folder too.

if you have over and underutilized nodes in the hadoop cluster you can run balancer which is in bin dir to make your cluster balanced.

NOTE : you should have enough space on your dfs for replication, because as you are increasing the replication factor, it will need space.

Wednesday, February 27, 2013

Import all tables from rdbms using sqoop

Import all tables from a database using Sqoop

sqoop import-all-tables --connect jdbc:mysql://<servername>/databasename

The command could not be located because '/usr/bin' is not included in the PATH environment variable. error

if you are getting this error :

Symptoms :

1. You are not able to login, loging screen returning back to itself.
2. If you press alt+ctrl+[F1-F7] and issue some come command and its throwing this error.

Solution:

First check "echo $PATH" if it is having entries like

/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/bin:
/bin:/sbin

Entries are there in the pats variable, if those are not there there is a problem.

So lets starrt :

Login with your user, on terminal type /usr/bin/vi or /usr/bin/vim /etc/environment

and add

PATH="/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/bin:/bin:/sbin"

and restart the machine.

Friday, February 22, 2013

Sunday, February 10, 2013

Difference between MySQL INT(1) or INT(10)


Hello Guys !!

Here, I would like to discuss the differences between MySQL int(1) & int(10) ... etc 

In short, it really doesn't matter. 
I know i'm not alone in thinking that it affected the size of the data field. An unsigned int has the max value of 4294967295 no matter if its INT(1) or int(10) and will use 4 bytes of data. 

 So, what does the number in the brackets mean? It pretty much comes down to display, its called the display-width. The display width is a number from 1 to 255. You can set the display width if you want all of your integer values to “appear”. If you enable zerofill on the row, the field will have a default value of 0 for int(1) and 0000000000 for int(10). 

There are 5 main numeric data types, and you should choose each one on its own merits. Based on the data you expect (or in some cases hope) to hold, you should use the correct data type. If you dont ever expect to use a value of above 127 and below -128, you should use a tinyint. This will only use 1 byte of data, which may not seem like much of a difference between the 4 used by an int, but as soon as you start to store more and more data, the effect on speed and space will be noticeable. 

 Anyway, I thought I should share my new found knowledge of the display width with everyone, because it will save me thinking its optimising stuff changing from 10 to 5, ha ha.

  Illustration :-

Friday, February 1, 2013

Phoenix: A SQL layer over HBase 'We put the SQL back in the NoSQL'

 

Phoenix is a SQL layer over HBase, delivered as a client-embedded JDBC driver, powering the HBase use cases at Salesforce.com. Phoenix targets low-latency queries (milliseconds), as opposed to batch operation via map/reduce. To see what's supported, go to our language reference guide, and read more on our wiki.

Tuesday, January 29, 2013

Reset root password in Linux

1. Reboot the system

2. Press any key while your linux box boot up, it will bring you the the Grub loader
There will be image name, select that

3. Press 'e'

4. It will enter into grub boot screen

5. It will be containing following list

  • · Image name
  • · Kernel Version
  • · MBR

6. Select Kernel line and press 'e' it will take you to edit mode

7. Put your curser to the last of the resulting line and type (single/1/s)

8. Save it

9. After saving it you will again to the step 5

10. Here you press ‘b’ this will boot up the system(by selecting the kernel version line)

11. This will take you to the single user mode

12. Then type ‘passwd’ it will not ask you the old password, will ask you to type new password,

13. Type the new password its done.

View file sise sorted, file size/folder size sorted in size

du -sm * | sort -nr --> Decending

du -sm * | sort -n  --> Ascending

du -sc * | sort -nr --> Total size of current folder

du -sm * | sort -nr | head -10 --> First 10 : Top 10 Largest Items

du -sm * | sort -nr | tail -10
--> Last 10

ls -ltr | grep [0-9][0-9][0-9]M/G  -->  will show files withree digit number in MB or GB

Sunday, January 20, 2013

FAILED FSError: java.io.IOException: No Space left on device

 

FAILED FSError: java.io.IOException: No Space left on device

org.apache.hadoop.util.DiskChecker$DiskErrorException : Could not find any valid local directory for taskTracker/jobcache

When you run hive queries and you queries fails, just go the the job tracker of Hadoop : mostly on 50030 port on your Hadoop nodes, and check the failed or killed jobs, you may find these errors.

The reason behind will be following:

Friday, January 18, 2013

Shell script: Using FOR loop

Here's a simple shell script to print numbers using for loop.

Just got through:


# cat test.sh

for ((i=0;i<=10;i++))
do

 echo $i

done



Output:

Wednesday, January 16, 2013

Monday, January 14, 2013

Automatically Finding files of specific size of type or modified date and adding to tar


Refer following command for your specific need:

#All files of size greater thab 1Gb

find / -type f -size +1048576 | xargs tar -czf myfile.tgz

#All files modified one day before

find / -type f -mtime -1 | xargs tar -czf myfile.tgz

#All files modified one day before and size more than 1Gb

find / -type f -mtime -1 -size +1048576 | xargs tar -czf myfile.tgz

------------------------ OR ------------------------------------

Featured Posts

#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc

 #Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc Linux is an open-source operating system that is loved by millio...