Infinite Programming Tips: Apache Cassandra

Friday, November 11, 2011

Apache Cassandra

Introduction

Cassandra is an advanced topic, and while work is always underway to make things easier, it can still be daunting to get up and running for the first time. This document aims to provide a few easy to follow steps to take the first-time user from installation, to an operational Cassandra cluster.

Step 0: Prerequisites and connection to the community

Cassandra requires the most stable version of Java 1.6 you can deploy. For Sun's jvm, this means at least u19; u21 is better. Cassandra also runs on the IBM jvm, and should run on jrockit as well.

The best way to ensure you always have up to date information on the project, releases, stability, bugs, and features is to subscribe to the users mailing list (subscription required) and participate in the #cassandra channel on IRC.

Step 1: Picking a version

At any given time, there are a number of different versions available to install:

Stable releases

Cassandra stable releases are well tested and reasonably free of serious problems, (or at least the problems are known and well documented). If you are setting up a production environment, a stable release is what you want.

Download links for the latest stable release can always be found on the website.

Betas and release candidates

Betas are prototype releases considered ready for user testing, and release candidates have the potential to become the next stable release. These releases represent the state-of-the-art so are often the best place to start, and since APIs and on-disk storage formats can change between major versions this can also save you from an upgrade. The testing and feedback is also highly appreciated.

Nightly builds

Nightly builds represent the current state of development as of the time of the build. They contain all of the previous day's new features, fixes, and newly introduced bugs. The only guarantee they come with is that they successfully build and the unit tests pass. Nightly builds are a handy way of testing recent changes, or accessing the latest features and fixes not found in beta or release candidates, but there is some risk of them being buggy.

The most recent nightly build can be downloaded here.

Subversion

Cassandra's subversion repository is where all active development takes place. Anyone interested in contributing to the project should use a checkout of trunk. If you do run from subversion, be sure to update frequently, and subscribe to the mailing list to stay abreast of the latest developments.

Instructions for checking out the source code can always be found on the website.

Step 2: Running a single node

Cassandra is meant to run on a cluster of nodes, but will run equally well on a single machine. This is a handy way of getting familiar with the software while avoiding the complexities of a larger system.

Since there isn't currently an installation method per se, the easiest solution is to simply run Cassandra from an extracted archive¹² or SVN checkout (see: Picking a version). Also, unless you've downloaded a binary distribution, you'll need to compile the software by invoking ant from the top-level directory.

The distribution's sample configuration conf/cassandra.yaml contains reasonable defaults for single node operation, but you will need to make sure that the paths exist for data_file_directories,commitlog_directory, and saved_caches_directory. Additionally, take a minute now to look over the logging configuration in conf/log4j.properties and make sure that directories exist for the configured log file(s) as well.

Some people running OS X have trouble getting Java 6 to work. If you've kept up with Apple's updates, Java 6 should already be installed (it comes in Mac OS X 10.5 Update 1). Unfortunately, Apple does not default to using it. What you have to do is change your JAVA_HOME environment setting to /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home and add/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin to the beginning of your PATH.

And now for the moment of truth, start up Cassandra by invoking bin/cassandra -f from the command line³. The service should start in the foreground and log gratuitously to standard-out. Assuming you don't see messages with scary words like "error", or "fatal", or anything that looks like a Java stack trace, then chances are you've succeeded. To be certain though, take some time to try out the examples in CassandraCli before moving on (note: if you are using Cassandra 0.7.0, you'll need to load the demo Keyspaces first using JMX, see http://wiki.apache.org/cassandra/FAQ#no_keyspaces, or even better follow testing instructions on the README of the installation folder). Also, if you run into problems, Don't Panic, calmly proceed to If Something Goes Wrong.

Users of recent Linux distributions and Mac OS X Snow Leopard should be able to start up Cassandra simply by untarring and invoking bin/cassandra -f with root privileges. Snow Leopard ships with Java 1.6.0 and does not require changing the JAVA_HOME environment variable or adding any directory to your PATH. On Linux just make sure you have a working Java JDK package installed such as theopenjdk-6-jdk on Ubuntu Lucid Lynx.

Step 3: Running a cluster

Setting up a Cassandra cluster is almost as simple as repeating Step 2 for each node in your cluster. There are a few minor exceptions though.

Cassandra nodes exchange information about one another using a mechanism called Gossip, but to get the ball rolling a newly started node needs to know of at least one other, this is called a Seed. It's customary to pick a small number of relatively stable nodes to serve as your seeds, but there is no hard-and-fast rule here. Do make sure that each seed also knows of at least one other, remember, the goal is to avoid a chicken-and-egg scenario and provide an avenue for all nodes in the cluster to discover one another.

In addition to seeds, you'll also need to configure the IP interface to listen on for Gossip and Thrift, (ListenAddress and ThriftAddress respectively). Use a ListenAddress that will be reachable from theListenAddress used on all other nodes, and a ThriftAddress that will be accessible to clients.

Once everything is configured and the nodes are running, use the bin/nodetool utility to verify a properly connected cluster. For example:

eevans@achilles:~$ bin/nodetool -host 98.139.220.175 ring
Address       Status     Load          Range                                      Ring
                                       169048975998562660269742699624378098572
98.139.220.175  Up         0.02 GB     14183696824377310051808173385764689249     |<--|
98.139.169.152  Up         0.4 GB      28356863910078205288614550619314017621     |   ^
98.139.220.176  Up         0.13 GB     42530828068625072228863933889289238187     |-->|

Advanced cluster management is described in Operations.

If you don't yet have access to hardware for a Cassandra cluster you can try it out on EC2 with CloudConfig.

Step 4: Write your application

The recommended way to communicate with Cassandra in your application is to use a higher-level client. These provide programming language specific API:s for talking to Cassandra in a variety of languages. The details will vary depending on programming language and client, but in general using a higher-level client will mean that you have to write less code and get several features for free that you would otherwise have to write yourself.

That said, it is useful to know that Cassandra uses Thrift for its external client-facing API. Cassandra's main API/RPC/Thrift port is 9160. Thrift supports a wide variety of languages so you can code your application to use Thrift directly if you so chose (but again we recommend a high-level client where available).

Important note: If you intend to use thrift directly, you need to install a version of thrift that matches the revision that your version of Cassandra uses. InstallThrift

Cassandra's main API/RPC/Thrift port is 9160. It is a common mistake for API clients to connect to the JMX port instead.

Checking out a demo application like Twissandra (Python + Django) will also be useful.

Infinite Programming Tips