Apache Cassandra Basics Flashcards

(104 cards)

1
Q

What is Apache Cassandra and what are its core characteristics as defined in “Cassandra: The Definitive Guide”?

A

Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunable, and consistent database. Its distribution design is based on Amazon’s Dynamo, and its data model is based on Google’s Bigtable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Cassandra’s architectural choice differ from MongoDB, and what advantage does this offer?

A

MongoDB uses a primary-secondary architecture, whereas Cassandra employs a simpler peer-to-peer architecture. This peer-to-peer design makes Cassandra one of the friendliest NoSQL database installations, as every node is identical and can handle all database operations independently, eliminating a single point of failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the role of the Commit Log in Cassandra’s write operations and its importance.

A

The Commit Log in Cassandra functions as an append-only log that captures all local mutations on a node before they are written to a Memtable. This ensures data durability, as any unpersisted mutations can be applied from the Commit Log upon restarting after an unexpected shutdown, guaranteeing consistency and recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the two main roles of a Primary Key in Cassandra and how they relate to data modeling.

A

The primary key in Cassandra has two main roles: to optimize the read performance of queries and to provide uniqueness to the entries. Data modeling in Cassandra is query-driven, meaning the primary key should be built based on the specific queries intended to be answered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Keyspace in Cassandra, and what is the recommended practice for its usage?

A

A Keyspace is the highest-level organizational unit for data in Cassandra, akin to a database in relational systems, containing one or more tables. It also defines options like the replication strategy for its tables. It is generally encouraged to use one keyspace per application for better organisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does a Partition Key determine data locality within a Cassandra cluster?

A

A Partition Key determines data locality by being the mandatory component of a primary key that receives a hash function.

This hash is then used to identify which node in the cluster, and its subsequent replicas, will store the data. This ensures an even spread of data across the cluster and enables efficient routing of queries to specific nodes.

All rows with the same partition key go in the same partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the purpose of a Clustering Key and how it influences data storage and retrieval within a partition.

A

A Clustering Key specifies the order (ascending or descending) in which data is arranged inside a partition. It optimises the retrieval of similar column data within a partition and contributes to the uniqueness of entries. This is crucial for improving read query performance, especially in large partitions, by reducing the amount of data that needs to be read.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Differentiate between a “static table” and a “dynamic table” in Cassandra’s data model.

A

A static table in Cassandra has a primary key composed solely of the partition key, without any clustering keys. This means each partition in a static table contains only one entry.

In contrast, a dynamic table’s primary key includes both a partition key and one or more clustering keys, allowing partitions to grow dynamically with multiple distinct entries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Cassandra Query Language (CQL), and how does its syntax compare to SQL?

A

CQL, or Cassandra Query Language, is the primary language for interacting with Apache Cassandra clusters. It has a simple, intuitive SQL-like syntax for operations like creating tables, inserting, updating, deleting, and selecting data. However, CQL differs significantly from SQL by not supporting joins and having different behaviors for operations like inserts, updates, and deletes which are performed directly in memory without prior reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe two ways in which CQL queries can be executed in Apache Cassandra.

A

CQL queries can be executed in two primary ways: programmatically, using a licensed Cassandra client driver available for various languages like Java, Python, and Scala, or via the CQL Shell client (cqlsh). The cqlsh is a Python-based command-line shell provided with the Cassandra package that connects to a single node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Aggregation?

A

The process of summarizing and computing data values.

Aggregation helps in data analysis and reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does Availability mean in the context of the CAP theorem?

A

It means that a distributed system remains operational and responsive, even in the presence of failures or network partitions.

Availability ensures that requests can still be served despite issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define BSON.

A

A binary-encoded serialization format, similar to JSON but designed for compactness and speed, used for efficient data storage and retrieval.

BSON is commonly used in MongoDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the CAP Theorem?

A

A theorem highlighting trade-offs in distributed systems, stating that during a network partition (P), a distributed system must choose between consistency (C) or availability (A).

CAP stands for Consistency, Availability, and Partition tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Cluster in Cassandra?

A

A group of interconnected servers or nodes that work together to store and manage data in a NoSQL database, providing high availability and fault tolerance.

Clusters are essential for distributed databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Clustering Key?

A

A component of the primary key that determines the order of data within a partition (ascending or descending) and optimizes retrieval of similar values.

Clustering keys enhance query performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a Commit Log?

A

An append-only log in Cassandra that captures all local mutations on a node before data is written to a Memtable, ensuring data durability and recovery upon restart.

The commit log plays a critical role in data safety.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define Consistency in the context of the CAP theorem.

A

It refers to the guarantee that all nodes in a distributed system have the same data at the same time.

Unlike traditional databases that follow strong consistency (like in SQL), Cassandra uses a tunable consistency model, where you choose the level of consistency based on your app’s needs.

Consistency is vital for ensuring data accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a Coordinator Node?

A

The node in a Cassandra cluster that manages a client request (write or read) and interacts with other nodes to replicate data or retrieve information based on the configured consistency level.

The coordinator node orchestrates data operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does CQL stand for?

A

Cassandra Query Language.

CQL is used for querying and managing data in Cassandra clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is CQL Shell (cqlsh)?

A

A Python-based command-line interface provided with the Cassandra package for interacting with Cassandra databases using CQL.

cqlsh allows users to execute CQL commands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Define Data Locality.

A

The concept that data is stored close to where it is most frequently accessed or processed, in Cassandra, determined by the partition key.

Data locality improves performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does Decentralized mean in Cassandra architecture?

A

An architectural characteristic where there is no single point of control or failure; every node in the cluster is identical and has equal status, communicating directly with others.

Decentralization enhances fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a Distributed system?

A

A system where data and processing are spread across multiple machines or nodes, but to users and applications, everything appears as a unified whole.

Distributed systems improve scalability and reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is a Dynamic Table in Cassandra?
A Cassandra table whose primary key is composed of both a partition key and one or more clustering keys, allowing partitions to grow dynamically with multiple distinct entries. ## Footnote Dynamic tables offer flexibility in data modeling.
26
Define Fault Tolerance.
The ability of a system to continue operating without interruption despite the failure of one or more of its components. ## Footnote Fault tolerance is crucial for high availability.
27
What is the Gossip Protocol?
A peer-to-peer communication protocol used by Cassandra nodes to exchange details and information, keeping all nodes updated about the status and membership of the cluster. ## Footnote Gossip protocols help maintain cluster health.
28
True or False: Cassandra supports joins.
False. ## Footnote Cassandra does not support joins by design, focusing on denormalized data models.
29
What is a Keyspace in Cassandra?
The highest-level organizational unit for data, similar to a database in traditional relational databases, containing one or more tables and defining options like replication strategy. ## Footnote Keyspaces are essential for data organization.
30
What is a Memtable?
An in-memory data structure within a Cassandra node that buffers write operations before they are flushed to disk as immutable SSTables. ## Footnote Memtables improve write performance.
31
What is a Node in Cassandra?
A standalone unit in Cassandra's distributed system architecture, also known as a single Cassandra instance, which operates independently and communicates with other nodes. ## Footnote Nodes are the building blocks of a Cassandra cluster.
32
What is a Partition?
A group of rows that share the same partition key, and are stored together on disk (or in memory) on the same node. It's the fundamental unit of data storage in Cassandra. Data is grouped into partitions based on the partition key and distributed across nodes. ## Footnote Partitions enhance data distribution and retrieval.
33
What is a Partitioner?
A consistent hashing algorithm used by Cassandra to distribute data evenly across nodes in a cluster, preventing hotspots. ## Footnote The partitioner is crucial for load balancing.
34
What is a Partition Key?
The mandatory component of a primary key that determines how data is distributed across nodes in a cluster by hashing its value. ## Footnote The partition key is essential for data distribution.
35
What does Peer-to-Peer refer to in Cassandra?
The overall architecture where each node in the cluster has equal status and communicates directly with other nodes without relying on a central coordinator. ## Footnote Peer-to-peer architectures improve resilience.
36
What is a Primary Key?
Consists of one or more columns that uniquely identify rows in a table. It includes a partition key and, optionally, clustering columns, and it optimizes read performance. ## Footnote Primary keys are vital for data integrity.
37
What is Replication?
The process of creating and maintaining copies of data on multiple nodes to ensure data availability, reduce data loss, improve fault tolerance, and provide read scalability. ## Footnote Replication is critical for data durability.
38
What does Replication Factor mean?
A setting that specifies the number of nodes that will hold replicas for each partition of data in Cassandra. Example: If RF = 3 → There are 3 replicas (nodes) for each piece of data. ## Footnote The replication factor influences data availability.
39
Define Scalability.
The ability of a system, like Cassandra, to handle increased data and traffic by adding more nodes to the cluster, often achieving linear performance growth. ## Footnote Scalability is essential for handling growing workloads.
40
What are SSTables?
Immutable data files in Cassandra that store data persistently on disk, created by flushing Memtables or streaming from other nodes. ## Footnote SSTables are key for data persistence.
41
What is a Static Table?
A Cassandra table where the primary key contains only the partition key (single or multiple columns) and no clustering key. Partitions in static tables have only one entry. ## Footnote Static tables are simpler in structure.
42
What is a Table in Cassandra?
A collection of related data, organized into rows and columns, defined by a schema that includes a primary key. ## Footnote Tables are fundamental to data organization.
43
What are Time Series Use Cases?
Applications where data comes in an append-wise, timely manner, often from sensors or logs, for which Cassandra is well-suited due to its fast write throughput and append-like data handling. ## Footnote Time series applications benefit from Cassandra's architecture.
44
What is Token Ring?
It is a logical structure to distribute data across the cluster, enabling efficient data location and management. It’s like a circular ring where each point represents a range of data values (tokens). Each node is responsible for a segment of this ring, i.e. a **range of tokens**. The token ring goes from the minimum token value to the maximum token value. ## Footnote Token ring helps in data partitioning.
45
What are Transactions in Cassandra?
Sequences of database operations (such as reading and writing data) that are treated as a single, indivisible unit. Cassandra has limited support for transactions. ## Footnote Transactions in Cassandra are not as robust as in traditional RDBMS.
46
What is Tunable Consistency?
A feature in Cassandra that allows developers to control the level of consistency for read and write operations, balancing consistency with availability based on application requirements. ## Footnote Tunable consistency provides flexibility for developers.
47
What is the Consistency LeveL (CL) and what are some common levels?
It defines how many of those replicas must respond to consider a read or write successful. Common levels are listed below. * **ONE**: Only 1 replica must respond * **QUORUM**: A majority of replicas must respond * **ALL**: All replicas must respond * **LOCAL_QUORUM**: Majority of replicas in the local data center respond * **ANY** (write only): Write to at least one node, even if a hint is stored temporarily
48
Given one example for read and another for write on how Consistency Level works in practice.
🛠️ **WRITE** Let’s say, the **Replication Factor was set to 3** and we are writing to Cassandra with **Consistency Level = QUORUM**. Then, the write will be considered successful only if 2 out of 3 nodes confirm the write. 📖 **READ** Same setup. Cassandra will read from 2 out of the 3 nodes and return the most recent version (using timestamps to reconcile differences if needed).
49
What is Token Range?
A token range is a specific slice of the token ring that a node (or replica) is responsible for. Example: Let’s say the full token range is from -2^63 to +2^63 - 1. With 4 nodes, Cassandra might assign: * Node A: -2^63 to -1 * Node B: 0 to 2^62 * Node C: 2^62 + 1 to 2^63 - 1 * Node D: and so on (depending on how tokens are assigned) Each of these ranges is a token range.
50
Explain how data is stored using Apache Cassandra, starting from data ingestion and taking into account any replication strategy.
📥 Data Ingestion (Write Request) A client sends a write request (e.g., inserting a new user or ride) to any node in the Cassandra cluster. This node is called the coordinator node. 🗺️ Partitioning: Where Should the Data Live? The partition key is used to determine which node(s) should store this data. It is hasheed using a consistent hashing algorithm that is then mapped to a specific token range, and therefore to a set of nodes responsible for storing it. Example: * Partition key: user_id = 12345 * Hash result: Falls into Node B’s token range → Node B is the primary replica. 📦 Replication: Making Copies Based on the Replication Factor (RF) and Replication Strategy in the keyspace, the data is replicated across multiple nodes. If RF = 3 and the primary node is Node B, then Cassandra might store the data on: * Node B (primary replica) * Node C (replica 2) * Node A (replica 3) 🧠 MemTable + Commit Log (In-Memory + Durable Storage) Each node that receives the data, **writes the data to a commit log** (durability; crash recovery) and **stores the data in memory** in a structure called a MemTable (for fast access). 💾 Flushing to SSTables (Disk Storage) Periodically (or when the MemTable is full), data is flushed to disk as SSTables (Sorted String Tables), which is an immutalbe file optimized for reads and compactions. 🧹 Compaction: Cleaning and Merging As new data is written (and SSTables pile up), compaction happens, i.e SSTables are merged and deleted/overwritten values are discarded. Ultimately, this reduces read latency by minimizing how many files must be searched 📖 Reads & Consistency When a client wants to read data, it contacts any node (coordinator). Based on the consistency level, the coordinator queries the required number of replicas. Cassandra reconciles the values (based on timestamps) and returns the latest one.
51
How often is data flushed from MemTables to SSTables? Are there any manual triggers for it?
MemTables are in-memory structures where writes are first stored. They are flushed to disk into SSTables automatically under several conditions, for example: * MemTable reaches a size threshold * Commit log size exceeds limit. * JVM heap pressure or memory limits force an early flush. Moreover, MemTables can be manually flushed using nodetool `nodetool flush `
52
How does Cassandra know whether the data is in a MemTable or an SSTable?
Cassandra checks both during a read operation. * MemTable(s): Cassandra **first checks all live MemTables** for the latest data. * SSTables: Then it **checks SSTables on disk**, possibly multiple files. * Bloom Filters + Indexes: Used to skip unnecessary SSTables. * Timestamps: Cassandra compares timestamps to return the latest version of the data.
53
Do I need to change my query depending on whether the data is on MemTables or SSTables?
I don’t need to write different queries for MemTable vs SSTable. From the client’s point of view, it’s totally transparent: `SELECT * FROM users WHERE user_id = 123;` Whether the data is in memory or on disk, Cassandra will: * Gather it, * Reconcile versions (if needed), * Return the latest consistent view based on your consistency level.
54
What would be inside a partition from the *users_orders* table as defined below? For the sake of the example, assume the following partition key: `user_id = 123`
55
Explain how query-based design works in Cassandra and how it is different from schema-based design from relational databases (e.g. MySQL or PostgreSQL). ## Footnote This modeling paradigm applies to NoSQL in general.
In Cassandra, you don’t **normalize** data. Instead, you design tables **based on** your queries (**your access patterns**). 💡 Ask yourself: “What questions will my app ask most frequently?” → Design a table **for each**. Let’s say your most common query is: *“Show me all orders for a user, with products and totals, sorted by date.”* You **denormalize** the data into a single table to make that query efficient: ``` CREATE TABLE user_orders ( user_id UUID, order_date TIMESTAMP, order_id UUID, product_id UUID, product_name TEXT, quantity INT, total DECIMAL, PRIMARY KEY ((user_id), order_date, order_id, product_id) ``` The query to read data would be: `SELECT * FROM user_orders WHERE user_id = ?;` Comparatively, in traditional relational databases, you would need to create four tables (users, orders, products, order_items) to ensure data normalization. Then, to read data, you would need to perform joins to combine data from different tables.
56
How can collection types emulate one-to-many relationships?
We can use the collection type `list` to include multiple values associated with the same instance. See example attached.
57
Describe the three main categories of data types available in Cassandra CQL.
Cassandra CQL data types are grouped into three main categories: built-in, collection, and user-defined. Built-in types are predefined, straightforward types like text, int, and boolean. Collection types group data in a single column (e.g., LIST, MAP, SET), while user-defined types allow users to create custom complex data types.
57
What does introducing an `IF` in `INSERT` and `UPDATE` statements make possible?
Instructs Cassandra to **look** for the data, **read** it, and only then **perform** a given operation. This is an **exception** to the standard *no-read-before-write* rule implemented in Cassandra.
58
Explain the purpose of the BLOB data type in Cassandra.
The BLOB data type in Cassandra stands for Binary Large Object. It is used to store binary data like images, audio, or other multimedia objects. While Cassandra primarily handles text-based information, blobs offer the flexibility to store non-textual data, though it's recommended that their size does not exceed 1 megabyte.
59
How do collection data types in Cassandra, such as LIST and MAP, help in data modelling, and when should they be avoided?
Collection data types group and store related data within a single column, emulating one-to-many relationships without requiring joins, which Cassandra avoids. For example, a LIST can store multiple email addresses for a user. However, collections should be avoided if the data has unbounded growth potential, such as frequently logged sensor events, as this can lead to large partitions and performance issues.
60
What is a User-Defined Type (UDT) in Cassandra, and for what kind of relationships are they best suited?
A User-Defined Type (UDT) in Cassandra allows users to create their own custom data types by combining multiple existing data types into a single column. UDTs are particularly useful for emulating one-to-one relationships, such as storing an entire address (apartment number, building, street) as a single entity within a column.
61
What is a Cassandra Keyspace, and why is it essential to define it before creating tables?
A Cassandra Keyspace is the highest-level organisational unit for data, analogous to a database in relational systems. It is crucial to define a keyspace before creating tables because there is no default keyspace. The keyspace also dictates important cluster-level properties like the replication factor and strategy for all its contained tables.
62
Distinguish between Replication Factor and Replication Strategy in Cassandra.
Replication Factor refers to the total number of copies of data that should be stored across different nodes for fault tolerance and high availability. Replication Strategy, on the other hand, determines how and where these data replicas are going to be located across the cluster nodes, considering factors like data centres and racks.
63
Explain the behaviour of INSERT and UPDATE operations in Cassandra by default, and how does this relate to "upsert" functionality?
By default, Cassandra does not perform a read operation before a write. This means that both INSERT and UPDATE operations behave similarly as "upserts." If you INSERT data into an existing primary key, it will update the existing record; similarly, if you UPDATE a non-existent record, it will create it.
64
What is the role of the coordinator node during a write operation in a Cassandra cluster?
During a write operation at the cluster level, the node that receives the write request becomes the coordinator. Its role is to ensure the completion of the operation by directing the write to all relevant replicas of the partition and then collecting acknowledgements from at least the minimum number of nodes specified by the consistency setting. Once successful, it returns the result to the user.
65
Why are "full table scans" generally discouraged in Cassandra, and what is the best practice for querying data for optimal performance?
Full table scans are strongly discouraged in Cassandra for production systems because they are resource-intensive and compromise performance, especially for large datasets. They involve sending the request to all nodes in the cluster. For optimal query performance, queries should always start with the partition key, followed by clustering keys in their defined order, to limit reads to specific partitions.
66
Describe what a "tombstone" is in Cassandra and explain why they are necessary for delete operations in a distributed system.
A tombstone in Cassandra is a special marker written during a delete operation, indicating that specific data has been logically deleted at a certain timestamp. They are necessary in Cassandra's peer-to-peer architecture to ensure that even if a delete operation doesn't immediately propagate to all replicas, subsequent reads reconcile the data based on timestamps, with the tombstone signalling the most recent "deleted" state.
67
What is Bigint in Cassandra?
A built-in data type for a 64-bit signed long integer
68
What does BLOB stand for in Cassandra?
Binary Large Object
69
What is the recommended maximum size for a BLOB in Cassandra?
1 megabyte
70
What are built-in data types in Cassandra CQL?
* Ascii * Boolean * Decimal * Double * Float * Int * Text * Bigint * Blob
71
What are clustering columns?
Components of the primary key that determine the order of data within a partition
72
What are collection data types in Cassandra?
Data types (Lists, Maps, Sets) used to group and store data together in a single column
73
What is the purpose of the commit log in Cassandra?
Records all mutations to ensure data durability before they are flushed to SSTables
74
What is compaction in Cassandra?
An optimization process that merges multiple SSTables into a single, larger SSTable
75
What does consistency refer to in the context of the CAP theorem?
Guarantee that all nodes in a distributed system have the same data at the same time
76
What is a coordinator node in a Cassandra cluster?
The node that receives a client's read or write request and coordinates the operation
77
What is a cluster in Cassandra?
A group of interconnected servers or nodes that work together to store and manage data
78
What is a dynamic table in Cassandra?
A table whose primary key is composed of both a partition key and a clustering key
79
What are GC grace seconds?
A configurable period after which tombstones are permanently removed during compaction
80
What are lightweight transactions (LWTs) in Cassandra?
Operations that enforce a read-before-write behaviour, providing stronger consistency guarantees. They are an exception to Cassandra's normal write behavior, and are built on the principle of: ✅ “Write this data only if some condition is true.” To evaluate the condition, Cassandra must read the existing data first. ## Footnote The tradeoff of using LWTs i that they are much slower that standard operations, so they need to be used sparingly
81
What is a list in Cassandra?
A collection data type representing an ordered collection of one or more elements
82
What is a map in Cassandra?
A collection data type representing a collection of key-value pairs
83
What is a memtable in Cassandra?
An in-memory data structure where writes are initially stored before being flushed to disk
84
What is the network topology strategy in Cassandra?
The recommended replication strategy for production systems, specifying data placement per data centre
85
What is a partition key in Cassandra?
A component of the primary key that determines how data is distributed across nodes
86
What is a primary key in Cassandra?
Consists of one or more columns that uniquely identify rows in a table
87
What is the replication factor in Cassandra?
Specifies the number of copies of data (replicas) stored across different nodes
88
What is a replication strategy in Cassandra?
Determines how data replicas are located across the cluster nodes
89
What are SSTables in Cassandra?
Immutable data files on disk where data is flushed from memtables
90
What are secondary indexes in Cassandra?
Allow querying data based on non-primary key columns but can impact write performance
91
What is a set in Cassandra?
A collection data type representing a collection of sorted and unique elements
92
What is a snapshot in Cassandra?
A backup of a keyspace or table taken prior to certain operations
93
What is a static column in Cassandra?
A special single-value column shared by all partition rows in a dynamic table
94
What is a static table in Cassandra?
A table where the primary key contains only a single column, acting as both partition key and identifier
95
What does Time To Live (TTL) mean in Cassandra?
An optional parameter that sets an expiration time for data
96
What is a timestamp in Cassandra?
A value attached to every write operation, used for data reconciliation
97
What is a tombstone in Cassandra?
A marker written during a delete operation to indicate that a specific piece of data has been deleted
98
What does the truncate command do in Cassandra?
Removes all data from a specified table but retains the table's schema
99
What are user-defined data types (UDTs) in Cassandra?
Custom data types created by the user in CQL, allowing multiple data types to be attached to a single column
100
Explain how the no-read-before-write rule in Cassandra differs from how write operations are executed in relational databases
🔄 What Does “No Read Before Write” Mean? In Apache Cassandra, when you run an INSERT or UPDATE, Cassandra does **not** check if the row or column already exists. * It simply writes the data with a timestamp, and the most recent value (highest timestamp) wins during reads or compaction. ⚖️ In Contrast: Relational Databases In relational databases, write operations often require a read first, especially when maintaining: * Constraints (e.g., uniqueness, foreign keys) * Triggers * Atomic operations
101
How does the following write operation differs between Cassandra and a relational database (e.g. MySQL) in terms of execution? ``` INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com'); ```
Cassandra * **No check** is made to see if a user with id=1 already exists. * If it exists, **data is overwritten** (depending on the timestamp). PostgreSQL * If **id=1 already exists** (primary key), it **fails** with a duplicate key error. * You need to use ON CONFLICT to allow upserts:
102
What is an upsert?
An upsert is a combination of “**up**date” and “in**sert**”. It means: ✅ Insert the row if it doesn’t exist, 🔄 Update the existing row if it does. Example: Let’s say you’re saving user information: * If the user with `ID = 42` already exists → Update their email. * If they don’t exist yet → Insert a new user with `ID = 42`. 📌 In Cassandra, upserts are always performed by **default** — there is no distinction between `INSERT` and `UPDATE`. `INSERT INTO users (id, name) VALUES (42, 'Alice');` `UPDATE users SET name = 'Alice' WHERE id = 42;` 🛠️ Both do the same thing and **no read is performed beforehand**. 🧪 In Relational Databases (like PostgreSQL or MySQL), you need special syntax: ``` INSERT INTO users (id, name) VALUES (42, 'Alice') ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name; ``` This tells the DB: * If `id=42` exists → update it. * Else → insert it.
103
How do prevent accidental overwrites of important data in Cassandra given the *no-read-before-write rule*?
✅ Option 1: Use Lightweight Transactions (LWT) ``` INSERT INTO users (id, email) VALUES (123, 'alice@example.com') IF NOT EXISTS; ``` ✅ Option 2: Handle It in the Client/Application Layer You can implement application-level checks to avoid overwriting. First, do a read: `SELECT * FROM users WHERE id = 123;` If no row is returned, proceed with the insert: `INSERT INTO users (id, email) VALUES (123, 'alice@example.com');` ✅ Option 3: Use UUIDs to Avoid Collisions Sometimes, the best strategy is to generate unique keys (e.g., UUID, TIMEUUID) so overwrites can’t happen by design.