Apache Cassandra Basics Flashcards
(104 cards)
What is Apache Cassandra and what are its core characteristics as defined in “Cassandra: The Definitive Guide”?
Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunable, and consistent database. Its distribution design is based on Amazon’s Dynamo, and its data model is based on Google’s Bigtable
How does Cassandra’s architectural choice differ from MongoDB, and what advantage does this offer?
MongoDB uses a primary-secondary architecture, whereas Cassandra employs a simpler peer-to-peer architecture. This peer-to-peer design makes Cassandra one of the friendliest NoSQL database installations, as every node is identical and can handle all database operations independently, eliminating a single point of failure.
Explain the role of the Commit Log in Cassandra’s write operations and its importance.
The Commit Log in Cassandra functions as an append-only log that captures all local mutations on a node before they are written to a Memtable. This ensures data durability, as any unpersisted mutations can be applied from the Commit Log upon restarting after an unexpected shutdown, guaranteeing consistency and recovery.
Describe the two main roles of a Primary Key in Cassandra and how they relate to data modeling.
The primary key in Cassandra has two main roles: to optimize the read performance of queries and to provide uniqueness to the entries. Data modeling in Cassandra is query-driven, meaning the primary key should be built based on the specific queries intended to be answered.
What is a Keyspace in Cassandra, and what is the recommended practice for its usage?
A Keyspace is the highest-level organizational unit for data in Cassandra, akin to a database in relational systems, containing one or more tables. It also defines options like the replication strategy for its tables. It is generally encouraged to use one keyspace per application for better organisation.
How does a Partition Key determine data locality within a Cassandra cluster?
A Partition Key determines data locality by being the mandatory component of a primary key that receives a hash function.
This hash is then used to identify which node in the cluster, and its subsequent replicas, will store the data. This ensures an even spread of data across the cluster and enables efficient routing of queries to specific nodes.
All rows with the same partition key go in the same partition.
Explain the purpose of a Clustering Key and how it influences data storage and retrieval within a partition.
A Clustering Key specifies the order (ascending or descending) in which data is arranged inside a partition. It optimises the retrieval of similar column data within a partition and contributes to the uniqueness of entries. This is crucial for improving read query performance, especially in large partitions, by reducing the amount of data that needs to be read.
Differentiate between a “static table” and a “dynamic table” in Cassandra’s data model.
A static table in Cassandra has a primary key composed solely of the partition key, without any clustering keys. This means each partition in a static table contains only one entry.
In contrast, a dynamic table’s primary key includes both a partition key and one or more clustering keys, allowing partitions to grow dynamically with multiple distinct entries.
What is the Cassandra Query Language (CQL), and how does its syntax compare to SQL?
CQL, or Cassandra Query Language, is the primary language for interacting with Apache Cassandra clusters. It has a simple, intuitive SQL-like syntax for operations like creating tables, inserting, updating, deleting, and selecting data. However, CQL differs significantly from SQL by not supporting joins and having different behaviors for operations like inserts, updates, and deletes which are performed directly in memory without prior reads.
Describe two ways in which CQL queries can be executed in Apache Cassandra.
CQL queries can be executed in two primary ways: programmatically, using a licensed Cassandra client driver available for various languages like Java, Python, and Scala, or via the CQL Shell client (cqlsh). The cqlsh is a Python-based command-line shell provided with the Cassandra package that connects to a single node.
What is Aggregation?
The process of summarizing and computing data values.
Aggregation helps in data analysis and reporting.
What does Availability mean in the context of the CAP theorem?
It means that a distributed system remains operational and responsive, even in the presence of failures or network partitions.
Availability ensures that requests can still be served despite issues.
Define BSON.
A binary-encoded serialization format, similar to JSON but designed for compactness and speed, used for efficient data storage and retrieval.
BSON is commonly used in MongoDB.
What is the CAP Theorem?
A theorem highlighting trade-offs in distributed systems, stating that during a network partition (P), a distributed system must choose between consistency (C) or availability (A).
CAP stands for Consistency, Availability, and Partition tolerance.
What is a Cluster in Cassandra?
A group of interconnected servers or nodes that work together to store and manage data in a NoSQL database, providing high availability and fault tolerance.
Clusters are essential for distributed databases.
What is a Clustering Key?
A component of the primary key that determines the order of data within a partition (ascending or descending) and optimizes retrieval of similar values.
Clustering keys enhance query performance.
What is a Commit Log?
An append-only log in Cassandra that captures all local mutations on a node before data is written to a Memtable, ensuring data durability and recovery upon restart.
The commit log plays a critical role in data safety.
Define Consistency in the context of the CAP theorem.
It refers to the guarantee that all nodes in a distributed system have the same data at the same time.
Unlike traditional databases that follow strong consistency (like in SQL), Cassandra uses a tunable consistency model, where you choose the level of consistency based on your app’s needs.
Consistency is vital for ensuring data accuracy.
What is a Coordinator Node?
The node in a Cassandra cluster that manages a client request (write or read) and interacts with other nodes to replicate data or retrieve information based on the configured consistency level.
The coordinator node orchestrates data operations.
What does CQL stand for?
Cassandra Query Language.
CQL is used for querying and managing data in Cassandra clusters.
What is CQL Shell (cqlsh)?
A Python-based command-line interface provided with the Cassandra package for interacting with Cassandra databases using CQL.
cqlsh allows users to execute CQL commands.
Define Data Locality.
The concept that data is stored close to where it is most frequently accessed or processed, in Cassandra, determined by the partition key.
Data locality improves performance.
What does Decentralized mean in Cassandra architecture?
An architectural characteristic where there is no single point of control or failure; every node in the cluster is identical and has equal status, communicating directly with others.
Decentralization enhances fault tolerance.
What is a Distributed system?
A system where data and processing are spread across multiple machines or nodes, but to users and applications, everything appears as a unified whole.
Distributed systems improve scalability and reliability.