Showing posts with label Chukwa. Show all posts
Showing posts with label Chukwa. Show all posts

Friday, November 11, 2011

Hadoop: Chukwa


Chukwa: Architecture and Design

Introduction

Log processing was one of the original purposes of MapReduce. Unfortunately, using Hadoop for MapReduce processing of logs is somewhat troublesome. Logs are generated incrementally across many machines, but Hadoop MapReduce works best on a small number of large files. And HDFS doesn't currently support appends, making it difficult to keep the distributed copy fresh.
Chukwa aims to provide a flexible and powerful platform for distributed data collection and rapid data processing. Our goal is to produce a system that's usable today, but that can be modified to take advantage of newer storage technologies (HDFS appends, HBase, etc) as they mature. In order to maintain this flexibility, Chukwa is structured as a pipeline of collection and processing stages, with clean and narrow interfaces between stages. This will facilitate future innovation without breaking existing code.
Chukwa has four primary components:
  1. Agents that run on each machine and emit data.
  2. Collectors that receive data from the agent and write it to stable storage.
  3. MapReduce jobs for parsing and archiving the data.
  4. HICC, the Hadoop Infrastructure Care Center; a web-portal style interface for displaying data.
Below is a figure showing the Chukwa data pipeline, annotated with data dwell times at each stage. A more detailed figure is available at the end of this document.
A picture of the chukwa data pipeline

Featured Posts

Run Commands for Windows

  🖥️ CPL Files (Control Panel Applets) Run via Win + R → filename.cpl Command Opens appwiz.cpl P...