HDFS: The Hadoop Distributed File System (HDFS)
is a distributed file system
designed to run on commodity hardware. It has many similarities
with existing distributed file systems. However, the differences from
other distributed file systems are significant. HDFS is highly
fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is
suitable for applications that have large data sets.
Map-Reduce: MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.
HBase : Use Apache HBase when you need random, realtime read/write access to
your Big Data.
This project's goal is the hosting of very large tables -- billions
of rows X millions of columns -- atop clusters of commodity hardware.
Apache HBase is an open-source, distributed, versioned, column-oriented
store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al.
Hive: Hive is a data warehouse system for Hadoop that facilitates
easy data summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop compatible file systems. Hive provides a
mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also
allows traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to express
this logic in HiveQL.
PIG:
Apache Pig is a platform for analyzing large data sets
that consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs. The
salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very
large data sets.
At the present time, Pig's infrastructure layer consists of a compiler
that produces sequences of Map-Reduce programs, for which large-scale
parallel implementations already exist (e.g., the Hadoop subproject).
Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
- Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility. Users can create their own functions to do special-purpose processing.
Sqoop: Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as
relational databases.
Zookeeper: ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually
skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different implementations
of these services lead to management complexity when the applications
are deployed.
Flume: Apache Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data. Its main goal is to deliver data from applications to Apache
Hadoop's HDFS. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. It
uses a simple extensible data model that allows for online analytic
applications
Mahout: Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
Hue : is a web interface for Apache Hadoop that makes common Hadoop tasks such as running MapReduce jobs, browsing HDFS, and creating Apache Oozie workflows, easier. (To learn more about the integration of Oozie and Hue, see this blog post.) In this post, we’re going to focus on how one of the fundamental components in Hue, Useradmin, has matured.
Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs.Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).Oozie is a scalable, reliable and extensible system.
Hue : is a web interface for Apache Hadoop that makes common Hadoop tasks such as running MapReduce jobs, browsing HDFS, and creating Apache Oozie workflows, easier. (To learn more about the integration of Oozie and Hue, see this blog post.) In this post, we’re going to focus on how one of the fundamental components in Hue, Useradmin, has matured.
Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs.Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).Oozie is a scalable, reliable and extensible system.
No comments:
Post a Comment
Thank you for Commenting Will reply soon ......