Cassandra! The Cursed Priestess

Abhishek Amralkar
5 min readJul 2, 2020

--

Image taken from google.https://foresite.com/listening-cassandras-avoid-cybersecurity-disasters/

To get the brief idea about the name please refer Wikipedia.

Okay we are not talking about the Cassandra the Priestess, in this post we will get to know the Apache Cassandra yes you guessed it right No-SQL database.

What is Apache Cassandra?

Wikipedia Says : Apache Cassandra is a free and open-source, distributed, wide column store, No-SQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data-centers,[1] with asynchronous master-less replication.

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.

Initially, Cassandra was developed at Facebook and later open-sourced and handed over to Apache Foundation.

The Architecture

Clusters:

The Clusters are the outermost structure of the Cassandra and sometime referred as the Ring because Cassandra assigns data to nodes in the cluster by arranging them in a ring.

  1. Keyspace

A Keyspace in Cassandra is an object that holds together all column families of a design. You can consider it as a database in No-SQL. Keyspace has two properties:

  • Replication — The replication option is to specify the replica placement strategy and the number of replicas wanted.
  • Durable Writes — By default, the durable_writes properties of a table is set to true, however it can be set to false.

Basic Syntax for creating a Keyspace is below

CREATE KEYSPACE IF NOT EXISTS cycling WITH REPLICATION = { ‘class’ : ‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 };

2. Column Family

Column Family is ordered collection of the rows and rows are the ordered collection of the column.

3. Row

A row is the collection of columns ordered columns.

4. Column

A column is the collection of key/value pairs.

5. RowKey

A primary key is called a row key.

6. Compound Primary Key

A primary key consist of multiple columns. One part of that key then called Partition Key and rest a Cluster Key.

7. Partition Key

Data in Cassandra is spread across the nodes. The purpose of the partition key is to identify the node that has stored that particular row which is being asked for. A function, called partition, is used to compute the hash value of the partition key at the time of row is being written.

Data center:

A set of related nodes are grouped in a data center.

Node:

The specific place where the data resides on the cluster is called a node.

Commit Log:

When you perform a write operation, it’s immediately written to the Commit logs. The commit log is a crash-recovery mechanism that supports Cassandra’s durability goals. Commit logs are an append only log.

Memtable:

After it’s written to the commit log, the value is written to a memory-resident data structure called the memtable. Memtable get written into memory. Each memtable has a threshold to flush the memtable.

SSTable:

When the number of objects stored in the memtable reaches a threshold, the contents of the memtable are flushed to disk in a file called an SSTable. SSTable are immutable.

Hinted Handoff:

Suppose out of n Cassandra nodes in a ring, 1 of the node lets say node A is down for the moment, and when the write request comes for the node A, node B which receives request will keep that request as a hint and once the node A comes up node B hand-off that hint. This feature makes the Cassandra highly available for writes.

Bloom Filters:

Bloom filter checks if the requested row exists in the SSTable before doing any disk I/O. High memory consumption can result from the Bloom filter false positive ratio being set too low. The higher the Bloom filter setting, the lower the memory consumption.

Compaction:

A compaction operation in Cassandra is performed in order to merge SSTables. During compaction, the data in SSTables is merged: the keys are merged, columns are com-bined, tombstones are discarded, and a new index is created. Compaction is the process of freeing up space by merging large accumulated data-files.

Tombstones:

In Cassandra, deleted data is not immediately purged from the disk. Instead, Cassandra writes a special value, known as a tombstone, to indicate that data has been deleted. Tombstones prevent deleted data from being returned during reads, and will eventually allow the data to be dropped via compaction.

Tombstones are writes — they go through the normal write path, take up space on disk, and make use of Cassandra’s consistency mechanisms. Tombstones can be propagated across the cluster via hints and repairs. If a cluster is managed properly, this ensures that data will remain deleted even if a node is down when the delete is issued.

Setup and Practical

Cassandra can be run on Bare Metal systems, VM’s in Cloud or on Containers. You can follow well documented steps for the installation as per your requirements.

For this blog post we will run a Cassandra cluster on Kind (Kubernetes)

You can use below config file to boot a Kind Cluster on your laptop

To install the Kind Cluster run below command

kind create cluster --config kind.yml

Once the Kind cluster get ready deploy the Cassandra Cluster with Kubernetes StatefulSet

Use below file

To deploy the run below command

kubectl apply -f cassandra-service.yml
kubectl apply -f cassandra-statefulset.yaml

Validate

❯ kubectl get svc cassandra
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cassandra ClusterIP None <none> 9042/TCP 3m
❯ kubectl get statefulset cassandra
NAME READY AGE
cassandra 2/3 4m2s
❯ kubectl get pods -l="app=cassandra"
NAME READY STATUS RESTARTS AGE
cassandra-0 1/1 Running 0 5m26s
cassandra-1 1/1 Running 0 3m18s
cassandra-2 0/1 Running 0 94s

Nodetool:

Nodetool is command line utility to interact with Cassandra cluster. Nodetool supports below sub commands, please follow the page

kubectl exec -it cassandra-0 -- nodetool status
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.18.0.6 84.81 KiB 32 66.4% e1eae515-c79e-4bbe-807d-9de479cdcefa Rack1-K8Demo
UN 172.18.0.4 104.55 KiB 32 63.1% ff955068-179a-401d-8600-1a5d0b3e4cde Rack1-K8Demo
UN 172.18.0.5 108.9 KiB 32 70.5% fee85da5-3338-4956-80ec-0091668ab2d4 Rack1-K8Demo

To get into Cassandra shell

kubectl exec -it cassandra-0 -- cqlsh

Lets create a sample Keyspace

create keyspace avengersInfo with replication = {'class' : 'SimpleStrategy', 'replication_factor':2}

Lets create Avengers Table

CREATE TABLE avengers (
character text,
name text,
year int,
PRIMARY KEY(character)
);

To access Cassandra Programmatically:

We will use Python to access the Cassandra data

pip3 install cassandra-driver

--

--