All the question that scared me now i am trying to scare them .. so that they cant scare others :)
Monday, December 30, 2013
Get file permission in numeric form in linux
Sunday, November 10, 2013
MySQL: Replication - Skip sql statement from execution
If the error is due to manual intervention on replication, we may have to skip the current SQL transaction on Slave which caused the error.
In the case, below is the command to be executed:
SET GLOBAL sql_slave_skip_counter=1;
START SLAVE;
You can increase the counter value if you want to skip multiple consecutive transactions.
Friday, October 4, 2013
Prevent accidental data loss in Hadoop
Thursday, October 3, 2013
Display two files side by side in linux
You have following options :
Wednesday, September 25, 2013
Starting/Stoping Hadoop and HBase in order
Namenode --> Datanodes --> SecondaryNamenode --> JobTracker --> TaskTrackers
Hadoop Shutdown Process
JobTracker --> TaskTracker --> Namenode --> Datanodes --> SecondaryNamenode
HBase Startup Process
Zookeepers --> HMaster --> RegionServers
HBase Shutdown Process
RegionServers --> HMaster --> Zookeepers
Thursday, September 19, 2013
Too many fetch-failures, hadoop mapreduce job, hadoop mapreduce Too many fetch-failures
1. Wrong DNS Entry, Hosts file entries,
Description:
Nodes are not able to communicate with each other, so check if nodes where this error comes you can ping or nslookup to that node
2. Tasktracker http thread
Description:
Check of the value of this tag in mapreduce-site.xml, if it is lower make it somewhat higer around 100.
Tuesday, September 17, 2013
Get Directory path from a file path/Get file name from a filepath in shell Linux/Unix
Command :
dirname /path/path1/filename
Output :
/path/path
Get filename from a filepath
Command:
basename /path/path/filename
Output:
filaname
Friday, August 23, 2013
Getting thread count of a process using PID in Linux
ps -o thcoung -p <PID>
If we need some more detail like process type, pid, user under which the process is running we can use following command :
ps -o user,comm,pid,thcount -p <PID>
Tuesday, August 6, 2013
Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Table 'xxxxxx' doesn't exist
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: javax.jdo.JDODataStoreException Exception thrown obtaining schema column information from datastore
Friday, July 26, 2013
The path "" is not a valid path to the 3.5.0-17-generic kernel headers / VMWare can't find linux headers path
If you this error falls in front of you (Ubuntu)
Throw this command infront of it :)
sudo apt-get install build-essential linux-headers-`uname -r` psmisc
-or-
sudo apt-get install linux-headers-$(uname -r)
If anyone of this command will executed successfully, then the error will run away from you :)
Sunday, July 14, 2013
Hadoop Hive Hbase pig configuration on linux
Configure Hadoop and HBase on Linux
Thursday, July 11, 2013
Setting HeartBeat Interval for Datanode
Monday, July 8, 2013
Call to failed on local exception: java.io.EOFException
2. check if namenode is running.
3. check for the version of hadoop
Thursday, July 4, 2013
How to copy files from one Hadoop Cluster to another ?
1 : copy file to local and then copy from local using
copyToLocal and then copyFromLocal
-get and -put
But not a good option.
So we have another option:
-cp and distcp
Distcp will require Map-reduce to be running if you dont want to run Mapreduce on your cluster you have other option that is -cp
Uses:
hadoop dfs -cp hdfs://<source> hdfs://<destination>
if you want faster copy use distcp for that your job tracker and task tracker must be running.
distcp uses:
hadoop distcp hdfs://<source> hdfs://<destination>.
Thursday, June 27, 2013
List top 10/n biggest/smallest files on hadoop (size wise)
hadoop dfs -du /testfiles/hd|sort -g -r|head -n <N> {N here is the top number of file you want to list}
hadoop dfs -du /testfiles/hd|sort -g -r|head -n 10
List top 10 biggest file on hadoop(Recursively) :
hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|head -n <N> {N here is the top numbers of files you want to list}
hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|head -n 10
hadoop dfs -du /testfiles/hd|sort -g -r|tail -n <N> {N here is the top number of file you want to list}
hadoop dfs -du /testfiles/hd|sort -g -r|tail -n 10
List top 10 smallest file on hadoop(Recursively) :
hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|tail -n <N> {N here is the top numbers of files you want to list}
hadoop dfs -lsr /|awk '{print $5 "\t\t" $8}'|sort -n -r|tail -n 10
Wednesday, June 26, 2013
Hadoop version support matrix, hadoop Hbase Version compatibility
Where
S = supported and tested, | ||
X = not supported, | ||
NT = it should run, but not tested enough. |
lib
directory. The bundled jar is ONLY for use in standalone mode. In distributed mode, it is critical
that the version of Hadoop that is out on your cluster match what is
under HBase. Replace the hadoop jar found in the HBase lib directory
with the hadoop jar you are running on your cluster to avoid version
mismatch issues. Make sure you replace the jar in HBase everywhere on
your cluster. Hadoop version mismatch issues have various manifestations
but often all looks like its hung up.
Saturday, June 22, 2013
Keep your locate database updated in linux
locate is a command that help us to find a file efficiently and a bit faster,
it maintains a database called
"/var/lib/mlocate/mlocate.db"
and also has a utility called "updatedb" which keeps this database updated with the
file which
are there in the system. so when we run this command this command runs it update
the database
and updates this database with newly file added. This database is kept by linux system to
keep most updated file list.
Wednesday, June 19, 2013
if you want a file not to be modified by anyone even the root user
This is called an immutable file which is supported in linux ext2/ext3
which you can do as follow:
suppose you want to create a file called configuration which you never wanted to be modified move or removed the what you can do is follows:
chattr +i configuration
after doing that this file configuration cant be deleted,moved or changed
if you want it to simple file again you can do
chattr -i configuration
Wednesday, May 29, 2013
Recursively list all files in a directory Linux
Friday, May 24, 2013
Find and move list of file spread over diffrent location on a linux machine
#----------------------------------------------------------------#
# USES : sh fileFinderMover.sh <filewithfilenames> <targetpath> #
#----------------------------------------------------------------#
#!bin/sh
for i in `cat $1`
do
path=`locate $i`
mv $path $2
echo $path
done
This script will seach for the file which give as list and will move to target location.
Wednesday, May 22, 2013
Replace a word in file Linux Shell
Eg:
sed -i 's/shashwat/shriparv/g'shashwat.txt
Better way to get a list of all files on Hadoop, using shell.
Saturday, April 13, 2013
List Files from hdfs/Hadoop Recursively using java
import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import java.util.logging.Level; import java.util.logging.Logger; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hdfs.DistributedFileSystem; /** * * @author Shashwat Shriparv * @email dwivedishashwat@gmail.com * @web helpmetocode.blogspot.com */ public class RecursivelyPrintFilesOnHDFS { public static void main(String[] args) throws IOException, InterruptedException, URISyntaxException { printFilesRecursively("hdfs://master1:9000/"); } public static void printFilesRecursively(String Url) throws IOException { try {
Write file to HDFS/Hadoop Read File From HDFS/Hadoop Using Java
import java.io.File; import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hdfs.DistributedFileSystem; /** * * @author Shashwat Shriparv * @email dwivedishashwat@gmail.com * @Web helpmetocode.blogspot.com */ public class WritetoHDFSReadFromHDFSWritToLocal { private static byte[] buffer; private static int bytesRead; public static void main(String[] args) throws IOException, InterruptedException, URISyntaxException {
Thursday, April 4, 2013
Insert string after each N lines in a file
Tuesday, April 2, 2013
Thursday, March 28, 2013
WebHDFS REST API
The HTTP REST API supports the complete FileSystem interface for HDFS.
Operations
- HTTP GET
- OPEN (see FileSystem.open)
- GETFILESTATUS (see FileSystem.getFileStatus)
- LISTSTATUS (see FileSystem.listStatus)
- GETCONTENTSUMMARY (see FileSystem.getContentSummary)
- GETFILECHECKSUM (see FileSystem.getFileChecksum)
- GETHOMEDIRECTORY (see FileSystem.getHomeDirectory)
- GETDELEGATIONTOKEN (see FileSystem.getDelegationToken)
- HTTP PUT
- CREATE (see FileSystem.create)
- MKDIRS (see FileSystem.mkdirs)
- RENAME (see FileSystem.rename)
- SETREPLICATION (see FileSystem.setReplication)
- SETOWNER (see FileSystem.setOwner)
- SETPERMISSION (see FileSystem.setPermission)
- SETTIMES (see FileSystem.setTimes)
- RENEWDELEGATIONTOKEN (see DistributedFileSystem.renewDelegationToken)
- CANCELDELEGATIONTOKEN (see DistributedFileSystem.cancelDelegationToken)
- HTTP POST
- APPEND (see FileSystem.append)
- HTTP DELETE
- DELETE (see FileSystem.delete)
For More Please Visit WebHDFS
Jobtracker API error - Call to localhost/127.0.0.1:50030 failed on local exception: java.io.EOFException
Try the port number listed in your $HADOOP_HOME/conf/mapred-site.xml under the mapred.job.tracker
property. Here's my pseudo mapred-site.xml conf
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
If you look at the JobTracker.getAddress(Configuration)
method, you can see it uses this property if you don't explicitly specify the jobtracker host / port:
public static InetSocketAddress getAddress(Configuration conf) {
String jobTrackerStr =
conf.get("mapred.job.tracker", "localhost:8012");
return NetUtils.createSocketAddr(jobTrackerStr);
}
Tuesday, March 26, 2013
See content of Tar/tar gz file without extracting it
tar -tf tarfileyouwanttopeekin.tar
or
tar -tf tarfileyouwanttopeekin.tar.gz
or
tar -tf tarfileyouwanttopeekin.tar.bz2
if you look into a Zip file :
unzip -l tarfileyouwanttopeekin.zip
Thursday, March 14, 2013
Adding Scheduler to Hadoop Cluster
As we know we we execute task or jobs on hadoop it follows FIFO Scheduling, but if you are in multi user hadoop environment the you will need better scheduler for the consistency and correctness of the task scheduling.
Hadoop comes with other schedulers too those are:
Fair Schedulers : This defines pools and over time; each pool gets around the same amount of resources.
Capacity Schedulers : This defines queues, and each queue has a guaranteed capacity. The capacity scheduler shares computer resources allocated to a queue with other queues if those resources are not in use.
For changing the scheduler you need to take your cluster offline and make some configuration changes, first make sure that the correct scheduler jar files are there. In older version of hadoop you need to put the jar file if not ther in lib directory but from hadoop 1 these jars available in the lib folder and if you are using the newer hadoop good news for you
Steps will be:
Using C++ or C to interact with hadoop
Are you a C++ or c programmer and you are not willing to write java code to interact with Hadoop/HDFS ha? Ok you have an option that is : llbhdfs native library that enables you to write programs in c or cpp to interact with Hadoop.
Current Hadoop distributions contain the pre-compiled libhdfs libraries for 32-bit and 64-bit Linux operating systems. You may have to download the Hadoop standard distribution and compile the libhdfs library from the source code, if your operating system is not compatible with the pre-compiled libraries.
For more information read following:
http://wiki.apache.org/hadoop/MountableHDFS
https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
http://xmodulo.com/2012/06/how-to-mount-hdfs-using-fuse.html
Writing code in cpp or c Follows:
Finding out block location and block size of file on HDFS
Have you ever needed to find out the Block location and Block size for a file which is lying on hdfs hadoop? if so here is the command you can use to find out that.
For that we need “fsck” command which hadoop provide.
Here goes the command:
bin/hadoop fsck /filepath/filenameonhdfs –block –files -location
This command will provide information about Block location on what data node its lying, what are the blocks for that file
Just go and play with the command you will understand more.
Bench marking Hadoop
You have completed setting up your Hadoop cluster?? Now its time to benchmark it, as this will help you to understand that your Hadoop is production ready, and configured properly.
So how to proceed?
The benchmarking programs are there in Hadoop*-test.jar file which you can call.
So lets try TestDFSIO : this will test the read/write performance of HDFS.
How we can run this?
Wednesday, March 13, 2013
jps not working: jps: command not found
or
do following
Create an alias for JPS using following:
alias jps='/usr/lib/jvm/jdk1.6.0_33/bin/jps'
else if you want to see java process running execute following commandps -ef | grep java
or
ps -aux| grep java
Thursday, February 28, 2013
Rebalancing your hdfs, hadoop cluster
You can go for re balancing of hdfs, what you can do is:
Suppose you have replication factor as 2 but some files are showing replication factor as 2 or 3 or 1 and you want your replication factor to be 2.
Just increase the replication factor and then decrease for your hdfs root file system recursively.
suppose you need to increase the replication factor 2 and you are having replication factor as 1, and hadoop is not automatically replicating these blocks than you can increase the replication factor to 3 and then again decrease the replication factor to 2
Use following command to increase and then decrease the replication factor.
Increasing:
hadoop dfs -setrep -R 3 / -----> this will increase the replication factors of all the files to 3 and replicate automatically once you have enough replication you can decrease the replication factor to stabalise the cluster as you needed
Decreaseing:
hadoop dfs -setrep -R 2 / ------> this will make the replication factor to 2 recursively for your hdfs root partition.
this method you can apply for a single file or a specific folder too.
if you have over and underutilized nodes in the hadoop cluster you can run balancer which is in bin dir to make your cluster balanced.
NOTE : you should have enough space on your dfs for replication, because as you are increasing the replication factor, it will need space.
Wednesday, February 27, 2013
Import all tables from rdbms using sqoop
sqoop import-all-tables --connect jdbc:mysql://<servername>/databasename
The command could not be located because '/usr/bin' is not included in the PATH environment variable. error
Symptoms :
1. You are not able to login, loging screen returning back to itself.
2. If you press alt+ctrl+[F1-F7] and issue some come command and its throwing this error.
Solution:
First check "echo $PATH" if it is having entries like
/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/bin:
/bin:/sbin
Entries are there in the pats variable, if those are not there there is a problem.
So lets starrt :
Login with your user, on terminal type /usr/bin/vi or /usr/bin/vim /etc/environment
and add
PATH="/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/bin:/bin:/sbin"
and restart the machine.
Friday, February 22, 2013
Sunday, February 10, 2013
Difference between MySQL INT(1) or INT(10)
Hello Guys !!
Here, I would like to discuss the differences between MySQL int(1) & int(10) ... etc
In short, it really doesn't matter.
I know i'm not alone in thinking that it affected the size of the data field. An unsigned int has the max value of 4294967295 no matter if its INT(1) or int(10) and will use 4 bytes of data.
So, what does the number in the brackets mean? It pretty much comes down to display, its called the display-width. The display width is a number from 1 to 255. You can set the display width if you want all of your integer values to “appear”. If you enable zerofill on the row, the field will have a default value of 0 for int(1) and 0000000000 for int(10).
There are 5 main numeric data types, and you should choose each one on its own merits. Based on the data you expect (or in some cases hope) to hold, you should use the correct data type. If you dont ever expect to use a value of above 127 and below -128, you should use a tinyint. This will only use 1 byte of data, which may not seem like much of a difference between the 4 used by an int, but as soon as you start to store more and more data, the effect on speed and space will be noticeable.
Anyway, I thought I should share my new found knowledge of the display width with everyone, because it will save me thinking its optimising stuff changing from 10 to 5, ha ha.
Illustration :-
Friday, February 1, 2013
Phoenix: A SQL layer over HBase 'We put the SQL back in the NoSQL'
Phoenix is a SQL layer over HBase, delivered as a client-embedded JDBC driver, powering the HBase use cases at Salesforce.com. Phoenix targets low-latency queries (milliseconds), as opposed to batch operation via map/reduce. To see what's supported, go to our language reference guide, and read more on our wiki.
Tuesday, January 29, 2013
Reset root password in Linux
1. Reboot the system
2. Press any key while your linux box boot up, it will bring you the the Grub loader
There will be image name, select that
3. Press 'e'
4. It will enter into grub boot screen
5. It will be containing following list
- · Image name
- · Kernel Version
- · MBR
6. Select Kernel line and press 'e' it will take you to edit mode
7. Put your curser to the last of the resulting line and type (single/1/s)
8. Save it
9. After saving it you will again to the step 5
10. Here you press ‘b’ this will boot up the system(by selecting the kernel version line)
11. This will take you to the single user mode
12. Then type ‘passwd’ it will not ask you the old password, will ask you to type new password,
13. Type the new password its done.
View file sise sorted, file size/folder size sorted in size
du -sm * | sort -n --> Ascending
du -sc * | sort -nr --> Total size of current folder
du -sm * | sort -nr | head -10 --> First 10 : Top 10 Largest Items
du -sm * | sort -nr | tail -10 --> Last 10
ls -ltr | grep [0-9][0-9][0-9]M/G --> will show files withree digit number in MB or GB
Sunday, January 20, 2013
FAILED FSError: java.io.IOException: No Space left on device
FAILED FSError: java.io.IOException: No Space left on device
org.apache.hadoop.util.DiskChecker$DiskErrorException : Could not find any valid local directory for taskTracker/jobcache
When you run hive queries and you queries fails, just go the the job tracker of Hadoop : mostly on 50030 port on your Hadoop nodes, and check the failed or killed jobs, you may find these errors.
The reason behind will be following:
Friday, January 18, 2013
Shell script: Using FOR loop
Just got through:
# cat test.sh
for ((i=0;i<=10;i++))
do
echo $i
done
Output:
Wednesday, January 16, 2013
Do you need easier way to view or manipulate the data saved in the - HBase
Monday, January 14, 2013
Automatically Finding files of specific size of type or modified date and adding to tar
Refer following command for your specific need:
#All files of size greater thab 1Gb
find / -type f -size +1048576 | xargs tar -czf myfile.tgz
#All files modified one day before
find / -type f -mtime -1 | xargs tar -czf myfile.tgz
#All files modified one day before and size more than 1Gb
find / -type f -mtime -1 -size +1048576 | xargs tar -czf myfile.tgz
------------------------ OR ------------------------------------
Featured Posts
#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc
#Linux Commands Unveiled: #date, #uname, #hostname, #hostid, #arch, #nproc Linux is an open-source operating system that is loved by millio...
-
Hadoop is a batch processing system, and Hadoop jobs tend to have high latency and incur substantial overhead in job submission and sched...
-
Print numbers in order : #!/bin/bash for i in $(seq 0 4) do for j in $(seq $i -1 0) do echo -n $j done echo done Will gi...