Apache Cassandra Flashcards

1
Q

What is Apache Cassandra?

A

Free, open source, distributed, NoSQL database. It is designed to handle large amounts of data on commodity hardware. It is a write heavy database, where it is optimized for more writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the shape of sharded clusters in Cassandra?

A

They form a ring structure and most probably use consistent hashing technique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cassandra data model

A

Column family is the way to store and organize data.
Table is a two dimensional view of a multi dimensional column family.
Operations on tables using the Cassandra Query Language.
Though we have to define column family before hand but columns are not fixed and can be added at any time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cassandra Database Elements (components)

A

Cluster - Container of keyspace
Keyspace - Corresponds to database
Column Family - Set of rows with similar structure
CQL Table - Tables in Cassandra Query Language

We can have more than one column as key in Cassandra

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Command to create Keyspace in Cassandra

A

CREATE KEYSPACE ABC WITH replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: ‘3’} AND durable_writes = ‘TRUE’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

NoSQL Database types

A
1) Key-Value stores
Amazon DynamoDB
Voldemort
Citrusleaf
Membase,
Riak,
etc
2) Document Database
MongoDB
Couch One
Terrastore
OrientDB
3) BigTable clones
BigTable (Google)
Cassandra
HBase
Hypertable
4) Graph Databases
FlockDB (Twitter)
AllegroGraph
DEX
InfoGrind
Neo4J
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Dynamo? And who wrote Dynamo White Paper?

A

Dynamo White Paper was written in Amazon. It is a key-value store and highly available. It was written to solve the cart problem.
It was focused on how do we build a data store that is
1) Reliable
2) Always On
3) Performant
It wasn’t new or something as it cited 24 other white papers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is BigTable? And who wrote the white paper?

A

BigTable is high volume sequential access datastore. It was written by Google.

1) Richer data model
2) 1 key. Lots of values
3) Fast sequential access
4) 38 papers cited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Cassandra? And who wrote the Cassandra white paper?

A

It was a blend of Dynamo paper and BigTable paper.
It had distributed features of Dynamo
BigData model and storage from BigTable
It was written by Facebook.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does cassandra cluster nodes share anything?

A

No, cassandra is based on shared nothing architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Basics of Cassandra Replication

A

Cassandra is Fully replicated.
Client writes local
Data syncs across WAN
Replication factor per Data Center

This is the differentiating factor between others and Cassandra.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is replication synchronous or Asynchronous?

A

Asynchronous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is there any elected master or leader in Cassandra?

A

No it does not have any master or slave or elected leaders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain briefly the write process of Cassandra on a single node.

A

1) Cassandra client fires a query - “update users set firstname=”Patrick” where id=”pmcfadin””
2) As soon as it is received by Cassandra, the first thing its going to do is write a mutation. Just like in Apache Thrift.

3) First it does is it writes the mutation to commit log. This ensures durability. Once data is recived by server, you want the data to be there.
Commit log is append only, so its very fast. If using spinning disk, the spindle will not be doing random seeks. It will just go click click, write data sequentially. So its very fast.

4) Then it is put to memtable. Memtable is strcture based on a table, Users. Memtable is identified by primary key and it will have many columns attached with it (billions) inspired from BigTable. Memtable represents row of storage data, it is stored in memory first.
5) Acknowledge to client. Simple write path. So it’s so fast. Hence scaling is easier.
6) Then when the Memtable starts filling then the contents are flushed to disk. The memtable is written out to file called sstable (sorted string table). It is “IMMUTABLE”. This flushing is sequential write. This is very different than random seeking of relational databases. Cassandra is sequential IO as opposed to random IO of relational databases.

Because of sequential write, order is preserved in disk. When you ask for it, it comes out in sequential read. That’s why timeseries database is so good with Cassandra. The access pattern is also going to be sequential.

7) What happens if same row is updated twice. Previous old record and new record. Now in sstable we will have two records in the sstable. Cassandra is going to lookout for latest timestamp. But the inefficiency is that it will have to choose from multiple records. To solve that it does compaction. Does merge sort on records and finally after merging it writes new compacted sstable.
So disk keeps going up and down, which is normal.

With cassandra you are backloading IO to later, due to requirement of fast read/write. So we change order of IO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is sstable?

A

It is sorted string table, where records are stored sorted in order with time and it is IMMUTABLE. Once written the contents can never be changed. So you have sequential writes on disk as opposed to relational database models where there are random writes/reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Given sstable is immutable what happens if same record is modified multiple times? How is latest value maintained?

A

sstable will have multiple entries for the same record, and always the latest is chosen. So you get latest data but there is inefficiency that it has to choose between multiple copies of data. So overcome that compaction is done where merge sort is applied and latest values survive. After compaction new sstable file is generated. So we have big sequential reads and big sequential writes and no random read/writes.

17
Q

How is the placement of data decided in cluster?

A

The primary key is used to decide the placement of data. The primary key is passed through MD5 hash generator or Murmur3 to 128 bit number. Suppose you have 4 nodes, then 4 ranges are created and based on what MD5 hash is generated for the key. The record is stored in the node between whose range the MD5 value belongs.

18
Q

How is the placement of data decided in cluster?

A

The primary key is used to decide the placement of data. The primary key is passed through MD5 hash generator or Murmur3 to 128 bit number. Suppose you have 4 nodes, then 4 ranges are created and based on what MD5 hash is generated for the key. The record is stored in the node between whose range the MD5 value belongs. So you get good distribution of data and distribution is consistent.

19
Q

How does replication happen in Cassandra?

A

Suppose data is written to Node C, then if replication factor is 3. It will also be written to Node D and Node A. That’s how replication is done. Then if any node goes down and comes back up. If same record is present in all three nodes, the majority win protocol is implemented. That if 2 out of three nodes have same value, then that value wins.

20
Q

Is hashing really based on range in Cassandra or something fancier to allow for adding of new servers without much penalty.

A

How consistent hash rings with virtual nodes are used. Where for each server multiple locations in hash ring are dedicated to one server. So for all the values that fall in viritual node for server 0 go to that server.
Watch GKCS youtube video on consistent hashing for this.

21
Q

How client read happens in Cluster?

A

Client of cassandra is aware of all the nodes in cluster. So read can go to any client. Now when read request is received on Node 1 and Node 3 is the designated primary, and other 2 nodes as replicas. Then Node 1 is going to co-ordinate for the client. It will send requests to the primary and the replicas as it knows about tokens and distribution. Once response is received on co-ordinator Node 1, then it responds to client with data.

22
Q

Define consistency levels in Cassandra

A

Consistency level is what makes application resilient to failures and keeps the data consistent.
Consistency level is set with each read and write.
1) ONE
2) QUORUM - > 51% replicas ack
2) LOCAL_QUORUM - > 51% of local DC replicas ack
3) LOCAL_ONE - Read repair only in local DC
4) TWO
5) ALL - All replicas ack, full consistency.

23
Q

CQL Table (Cassandra Query Language) table creation

A

Table creation syntax is similar to what you would do in a RDBMS but the main difference is that there is no sizing. Each column can have random size with expandable up to 2GB.

CREATE TABLE users ( username varchar, firstname varchar, lastname varchar, email list, password varchar, created_date timestamp, PRIMARY_KEY(username)).

By default first key in the primary key becomes the partition key.

24
Q

CQL Table Insert

A

INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES (‘nandupathai’, ‘Narendra’, ‘Pathai’, ‘narendra.pathai@gmail.com’, ‘123124’, ‘12-12-2012’)

25
Q

What happens if we insert same record twice?

A

By default inserts are overwrite, unless we explicitly tell it to check before.

INSERT INTO users IF NOT EXISTS …

This uses paxos to lock and check if exisits only then insert.