Blog

Sign up for email updates

apasca's picture
Integrating Cassandra data nodes in a Hadoop cluster
By Alex, in General, R&D Insights

May. 10th 2011 0 comments

1. Networking setup

Make sure the machines are able to reach each other on the network. Also update /etc/hosts on all machines. For example in our setup we use:

# /etc/hosts on all nodes
192.168.1.96 master
192.168.1.97 slave1
192.168.1.98 slave2
192.168.1.99 slave3

2. Java

Install java 6 if not already installed.

$ sudo add-apt-repository "deb http://archive.canonical.com/ubuntu maverick partner"
$ sudo apt-get update
$ sudo apt-get install sun-java6-jre

3. Hadoop installation

Add Cloudera CDH3 apt repository.

$ sudo add-apt-repository "deb http://archive.cloudera.com/debian maverick-cdh3 contrib"
$ wget -O - http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Install Hadoop.

On master:

$ sudo apt-get install hadoop-0.20-{namenode,datanode,jobtracker,tasktracker}

On slaves:

$ sudo apt-get install hadoop-0.20-{datanode,tasktracker}

This will create: hadoop group, hdfs and mapred users. The namenode and datanode daemons will run use hdfs user. The jobtracker and tasktracker daemons will use mapred user.

You can find Clouderaís patches for hadoop here: /usr/lib/hadoop-0.20/cloudera/patches

4. Hadoop configuration.

Configure hadoop on all machines:

Edit core-site.xml

 <property>
<name> fs.default.name </name>
<value> hdfs://master:8020 </value>
</property>

Edit hdfs-site.xml

 <property>
<name> dfs.replication </name>
<value> 2 </value>
</property>

Edit mapred-site.xml

 <property>
<name> mapred.job.tracker </name>
<value> master:8021 </value>
</property>

Stating the cluster is done is 2 steps:

Step1: Starting HDFS

The namenode daemon must be started on master:

$ sudo service hadoop-0.20-namenode start

Then datanode daemons must be started on all slaves (in our setup: master, slave1, slave2 and slave3):

$ sudo service hadoop-0.20-datanode start

Check the success or failure by inspecting the logs on master and slaves:

Check namenode log on master

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-namenode-demo.log

Check datanode logs on all slaves

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-datanode-demo.log

Step2: Starting MapReduce

The jobtracker daemon must be started on master:

$ sudo service hadoop-0.20-jobtracker start

Then tasktracker daemons must be started on all slaves (in our setup: master, slave1, slave2 and slave3):

$ sudo service hadoop-0.20-tasktracker start

Check the success or failure by inspecting the logs on master and slaves:

Check jobtracker log on master

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-jobtracker-demo.log

Check tasktracker logs on all slaves

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-tasktracker-demo.log

5. Install Cassandra

You must do this on all cassandra nodes. A common practice is to install cassandra on every hadoop datanode. So every hadoop datanode will also be a cassandra node.

Import cassandraís apt repository key

gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
gpg --export --armor F758CE318D77295D | sudo apt-key add -

Add cassandraís apt repository entries in /etc/apt/sources.list and install cassandra.

$ sudo add-apt-repository "deb http://www.apache.org/dist/cassandra/debian unstable main"
$ sudo apt-get update
$ sudo apt-get install cassandra

6. Configure Cassandra

Edit cassandra.yaml on all nodes, replace hostname_or_ip with the hostname or the ip of the node.

listen_address: 
rpc_address:

After you configured all the nodes you can check cassandra cluster status using:

$ /usr/bin/nodetool -host  ring

Replace with the hostname or ip of a node. If you properly configured the cluster then you shall see a list of all the nodes from the cluster along with their assigned tokens.

7. Balance Cassandra Cluster

If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.

Here's a python program which can be used to calculate new tokens for the nodes. There's more info on the subject at Ben Black's presentation at Cassandra Summit 2010. http://www.datastax.com/blog/slides-and-videos-cassandra-summit-2010

def tokens(nodes):
for x in xrange(nodes):
print 2 ** 127 / nodes * x

There's also nodetool loadbalance which is essentially a convenience over decommission + bootstrap, only instead of telling the target node where to move on the ring it will choose its location based on the same heuristic as Token selection on bootstrap. You should not use this as it doesn't rebalance the entire ring.

The status of move and balancing operations can be monitored using nodetool with the streams argument.

 /usr/bin/nodetool -host  move 

Leave a comment

News

Recognition Based on Outstanding Corporate Growth and Market LeadershipFairfax, VA – January...
Fairfax, VA (PRWeb) January 10, 2012 – Three Pillar Global, a trusted, high-growth software...
Fairfax, VA (PRWeb) January 3, 2012 – Three Pillar Global, a trusted, high-growth software...

Careers

Passionate, talented, ambitious, creative … sound familiar?

Three Pillar Global is hiring the best and brightest, and we might just be looking for you.

Check out our open positions.

Join Us