Apache Cassandra Basics Flashcards

Question

What is a Dynamic Table in Cassandra?

Answer 1

A Cassandra table whose primary key is composed of both a partition key and one or more clustering keys, allowing partitions to grow dynamically with multiple distinct entries. ## Footnote Dynamic tables offer flexibility in data modeling.

Answer 2

The ability of a system to continue operating without interruption despite the failure of one or more of its components. ## Footnote Fault tolerance is crucial for high availability.

Answer 3

A peer-to-peer communication protocol used by Cassandra nodes to exchange details and information, keeping all nodes updated about the status and membership of the cluster. ## Footnote Gossip protocols help maintain cluster health.

Answer 4

False. ## Footnote Cassandra does not support joins by design, focusing on denormalized data models.

Answer 5

The highest-level organizational unit for data, similar to a database in traditional relational databases, containing one or more tables and defining options like replication strategy. ## Footnote Keyspaces are essential for data organization.

Answer 6

An in-memory data structure within a Cassandra node that buffers write operations before they are flushed to disk as immutable SSTables. ## Footnote Memtables improve write performance.

Answer 7

A standalone unit in Cassandra's distributed system architecture, also known as a single Cassandra instance, which operates independently and communicates with other nodes. ## Footnote Nodes are the building blocks of a Cassandra cluster.

Answer 8

A group of rows that share the same partition key, and are stored together on disk (or in memory) on the same node. It's the fundamental unit of data storage in Cassandra. Data is grouped into partitions based on the partition key and distributed across nodes. ## Footnote Partitions enhance data distribution and retrieval.

Answer 9

A consistent hashing algorithm used by Cassandra to distribute data evenly across nodes in a cluster, preventing hotspots. ## Footnote The partitioner is crucial for load balancing.

Answer 10

The mandatory component of a primary key that determines how data is distributed across nodes in a cluster by hashing its value. ## Footnote The partition key is essential for data distribution.

Answer 11

The overall architecture where each node in the cluster has equal status and communicates directly with other nodes without relying on a central coordinator. ## Footnote Peer-to-peer architectures improve resilience.

Answer 12

Consists of one or more columns that uniquely identify rows in a table. It includes a partition key and, optionally, clustering columns, and it optimizes read performance. ## Footnote Primary keys are vital for data integrity.

Answer 13

The process of creating and maintaining copies of data on multiple nodes to ensure data availability, reduce data loss, improve fault tolerance, and provide read scalability. ## Footnote Replication is critical for data durability.

Answer 14

A setting that specifies the number of nodes that will hold replicas for each partition of data in Cassandra. Example: If RF = 3 → There are 3 replicas (nodes) for each piece of data. ## Footnote The replication factor influences data availability.

Answer 15

The ability of a system, like Cassandra, to handle increased data and traffic by adding more nodes to the cluster, often achieving linear performance growth. ## Footnote Scalability is essential for handling growing workloads.

Answer 16

Immutable data files in Cassandra that store data persistently on disk, created by flushing Memtables or streaming from other nodes. ## Footnote SSTables are key for data persistence.

Answer 17

A Cassandra table where the primary key contains only the partition key (single or multiple columns) and no clustering key. Partitions in static tables have only one entry. ## Footnote Static tables are simpler in structure.

Answer 18

A collection of related data, organized into rows and columns, defined by a schema that includes a primary key. ## Footnote Tables are fundamental to data organization.

Answer 19

Applications where data comes in an append-wise, timely manner, often from sensors or logs, for which Cassandra is well-suited due to its fast write throughput and append-like data handling. ## Footnote Time series applications benefit from Cassandra's architecture.

Answer 20

It is a logical structure to distribute data across the cluster, enabling efficient data location and management. It’s like a circular ring where each point represents a range of data values (tokens). Each node is responsible for a segment of this ring, i.e. a **range of tokens**. The token ring goes from the minimum token value to the maximum token value. ## Footnote Token ring helps in data partitioning.

Answer 21

Sequences of database operations (such as reading and writing data) that are treated as a single, indivisible unit. Cassandra has limited support for transactions. ## Footnote Transactions in Cassandra are not as robust as in traditional RDBMS.

Answer 22

A feature in Cassandra that allows developers to control the level of consistency for read and write operations, balancing consistency with availability based on application requirements. ## Footnote Tunable consistency provides flexibility for developers.

Answer 23

It defines how many of those replicas must respond to consider a read or write successful. Common levels are listed below. * **ONE**: Only 1 replica must respond * **QUORUM**: A majority of replicas must respond * **ALL**: All replicas must respond * **LOCAL_QUORUM**: Majority of replicas in the local data center respond * **ANY** (write only): Write to at least one node, even if a hint is stored temporarily

Answer 24

🛠️ **WRITE** Let’s say, the **Replication Factor was set to 3** and we are writing to Cassandra with **Consistency Level = QUORUM**. Then, the write will be considered successful only if 2 out of 3 nodes confirm the write. 📖 **READ** Same setup. Cassandra will read from 2 out of the 3 nodes and return the most recent version (using timestamps to reconcile differences if needed).

Answer 25

A token range is a specific slice of the token ring that a node (or replica) is responsible for. Example: Let’s say the full token range is from -2^63 to +2^63 - 1. With 4 nodes, Cassandra might assign: * Node A: -2^63 to -1 * Node B: 0 to 2^62 * Node C: 2^62 + 1 to 2^63 - 1 * Node D: and so on (depending on how tokens are assigned) Each of these ranges is a token range.

Answer 26

📥 Data Ingestion (Write Request) A client sends a write request (e.g., inserting a new user or ride) to any node in the Cassandra cluster. This node is called the coordinator node. 🗺️ Partitioning: Where Should the Data Live? The partition key is used to determine which node(s) should store this data. It is hasheed using a consistent hashing algorithm that is then mapped to a specific token range, and therefore to a set of nodes responsible for storing it. Example: * Partition key: user_id = 12345 * Hash result: Falls into Node B’s token range → Node B is the primary replica. 📦 Replication: Making Copies Based on the Replication Factor (RF) and Replication Strategy in the keyspace, the data is replicated across multiple nodes. If RF = 3 and the primary node is Node B, then Cassandra might store the data on: * Node B (primary replica) * Node C (replica 2) * Node A (replica 3) 🧠 MemTable + Commit Log (In-Memory + Durable Storage) Each node that receives the data, **writes the data to a commit log** (durability; crash recovery) and **stores the data in memory** in a structure called a MemTable (for fast access). 💾 Flushing to SSTables (Disk Storage) Periodically (or when the MemTable is full), data is flushed to disk as SSTables (Sorted String Tables), which is an immutalbe file optimized for reads and compactions. 🧹 Compaction: Cleaning and Merging As new data is written (and SSTables pile up), compaction happens, i.e SSTables are merged and deleted/overwritten values are discarded. Ultimately, this reduces read latency by minimizing how many files must be searched 📖 Reads & Consistency When a client wants to read data, it contacts any node (coordinator). Based on the consistency level, the coordinator queries the required number of replicas. Cassandra reconciles the values (based on timestamps) and returns the latest one.

Answer 27

MemTables are in-memory structures where writes are first stored. They are flushed to disk into SSTables automatically under several conditions, for example: * MemTable reaches a size threshold * Commit log size exceeds limit. * JVM heap pressure or memory limits force an early flush. Moreover, MemTables can be manually flushed using nodetool `nodetool flush `

How does Cassandra know whether the data is in a MemTable or an SSTable?

Cassandra checks both during a read operation. * MemTable(s): Cassandra **first checks all live MemTables** for the latest data. * SSTables: Then it **checks SSTables on disk**, possibly multiple files. * Bloom Filters + Indexes: Used to skip unnecessary SSTables. * Timestamps: Cassandra compares timestamps to return the latest version of the data.

Do I need to change my query depending on whether the data is on MemTables or SSTables?

I don’t need to write different queries for MemTable vs SSTable. From the client’s point of view, it’s totally transparent: `SELECT * FROM users WHERE user_id = 123;` Whether the data is in memory or on disk, Cassandra will: * Gather it, * Reconcile versions (if needed), * Return the latest consistent view based on your consistency level.

What would be inside a partition from the *users_orders* table as defined below? For the sake of the example, assume the following partition key: `user_id = 123`

Explain how query-based design works in Cassandra and how it is different from schema-based design from relational databases (e.g. MySQL or PostgreSQL). ## Footnote This modeling paradigm applies to NoSQL in general.

In Cassandra, you don’t **normalize** data. Instead, you design tables **based on** your queries (**your access patterns**). 💡 Ask yourself: “What questions will my app ask most frequently?” → Design a table **for each**. Let’s say your most common query is: *“Show me all orders for a user, with products and totals, sorted by date.”* You **denormalize** the data into a single table to make that query efficient: ``` CREATE TABLE user_orders ( user_id UUID, order_date TIMESTAMP, order_id UUID, product_id UUID, product_name TEXT, quantity INT, total DECIMAL, PRIMARY KEY ((user_id), order_date, order_id, product_id) ``` The query to read data would be: `SELECT * FROM user_orders WHERE user_id = ?;` Comparatively, in traditional relational databases, you would need to create four tables (users, orders, products, order_items) to ensure data normalization. Then, to read data, you would need to perform joins to combine data from different tables.

How can collection types emulate one-to-many relationships?

We can use the collection type `list` to include multiple values associated with the same instance. See example attached.

Describe the three main categories of data types available in Cassandra CQL.

Cassandra CQL data types are grouped into three main categories: built-in, collection, and user-defined. Built-in types are predefined, straightforward types like text, int, and boolean. Collection types group data in a single column (e.g., LIST, MAP, SET), while user-defined types allow users to create custom complex data types.

What does introducing an `IF` in `INSERT` and `UPDATE` statements make possible?

Instructs Cassandra to **look** for the data, **read** it, and only then **perform** a given operation. This is an **exception** to the standard *no-read-before-write* rule implemented in Cassandra.

Explain the purpose of the BLOB data type in Cassandra.

The BLOB data type in Cassandra stands for Binary Large Object. It is used to store binary data like images, audio, or other multimedia objects. While Cassandra primarily handles text-based information, blobs offer the flexibility to store non-textual data, though it's recommended that their size does not exceed 1 megabyte.

How do collection data types in Cassandra, such as LIST and MAP, help in data modelling, and when should they be avoided?

Collection data types group and store related data within a single column, emulating one-to-many relationships without requiring joins, which Cassandra avoids. For example, a LIST can store multiple email addresses for a user. However, collections should be avoided if the data has unbounded growth potential, such as frequently logged sensor events, as this can lead to large partitions and performance issues.

What is a User-Defined Type (UDT) in Cassandra, and for what kind of relationships are they best suited?

A User-Defined Type (UDT) in Cassandra allows users to create their own custom data types by combining multiple existing data types into a single column. UDTs are particularly useful for emulating one-to-one relationships, such as storing an entire address (apartment number, building, street) as a single entity within a column.

What is a Cassandra Keyspace, and why is it essential to define it before creating tables?

A Cassandra Keyspace is the highest-level organisational unit for data, analogous to a database in relational systems. It is crucial to define a keyspace before creating tables because there is no default keyspace. The keyspace also dictates important cluster-level properties like the replication factor and strategy for all its contained tables.

Distinguish between Replication Factor and Replication Strategy in Cassandra.

Replication Factor refers to the total number of copies of data that should be stored across different nodes for fault tolerance and high availability. Replication Strategy, on the other hand, determines how and where these data replicas are going to be located across the cluster nodes, considering factors like data centres and racks.

Explain the behaviour of INSERT and UPDATE operations in Cassandra by default, and how does this relate to "upsert" functionality?

By default, Cassandra does not perform a read operation before a write. This means that both INSERT and UPDATE operations behave similarly as "upserts." If you INSERT data into an existing primary key, it will update the existing record; similarly, if you UPDATE a non-existent record, it will create it.

What is the role of the coordinator node during a write operation in a Cassandra cluster?

During a write operation at the cluster level, the node that receives the write request becomes the coordinator. Its role is to ensure the completion of the operation by directing the write to all relevant replicas of the partition and then collecting acknowledgements from at least the minimum number of nodes specified by the consistency setting. Once successful, it returns the result to the user.

Why are "full table scans" generally discouraged in Cassandra, and what is the best practice for querying data for optimal performance?

Full table scans are strongly discouraged in Cassandra for production systems because they are resource-intensive and compromise performance, especially for large datasets. They involve sending the request to all nodes in the cluster. For optimal query performance, queries should always start with the partition key, followed by clustering keys in their defined order, to limit reads to specific partitions.

Describe what a "tombstone" is in Cassandra and explain why they are necessary for delete operations in a distributed system.

A tombstone in Cassandra is a special marker written during a delete operation, indicating that specific data has been logically deleted at a certain timestamp. They are necessary in Cassandra's peer-to-peer architecture to ensure that even if a delete operation doesn't immediately propagate to all replicas, subsequent reads reconcile the data based on timestamps, with the tombstone signalling the most recent "deleted" state.

What is Bigint in Cassandra?

A built-in data type for a 64-bit signed long integer

What does BLOB stand for in Cassandra?

Binary Large Object

What is the recommended maximum size for a BLOB in Cassandra?

1 megabyte

What are built-in data types in Cassandra CQL?

* Ascii * Boolean * Decimal * Double * Float * Int * Text * Bigint * Blob

What are clustering columns?

Components of the primary key that determine the order of data within a partition

What are collection data types in Cassandra?

Data types (Lists, Maps, Sets) used to group and store data together in a single column

What is the purpose of the commit log in Cassandra?

Records all mutations to ensure data durability before they are flushed to SSTables

What is compaction in Cassandra?

An optimization process that merges multiple SSTables into a single, larger SSTable

What does consistency refer to in the context of the CAP theorem?

Guarantee that all nodes in a distributed system have the same data at the same time

What is a coordinator node in a Cassandra cluster?

The node that receives a client's read or write request and coordinates the operation

What is a cluster in Cassandra?

A group of interconnected servers or nodes that work together to store and manage data

What is a dynamic table in Cassandra?

A table whose primary key is composed of both a partition key and a clustering key

What are GC grace seconds?

A configurable period after which tombstones are permanently removed during compaction

What are lightweight transactions (LWTs) in Cassandra?

Operations that enforce a read-before-write behaviour, providing stronger consistency guarantees. They are an exception to Cassandra's normal write behavior, and are built on the principle of: ✅ “Write this data only if some condition is true.” To evaluate the condition, Cassandra must read the existing data first. ## Footnote The tradeoff of using LWTs i that they are much slower that standard operations, so they need to be used sparingly

What is a list in Cassandra?

A collection data type representing an ordered collection of one or more elements

What is a map in Cassandra?

A collection data type representing a collection of key-value pairs

What is a memtable in Cassandra?

An in-memory data structure where writes are initially stored before being flushed to disk

What is the network topology strategy in Cassandra?

The recommended replication strategy for production systems, specifying data placement per data centre

What is a partition key in Cassandra?

A component of the primary key that determines how data is distributed across nodes

What is a primary key in Cassandra?

Consists of one or more columns that uniquely identify rows in a table

What is the replication factor in Cassandra?

Specifies the number of copies of data (replicas) stored across different nodes

What is a replication strategy in Cassandra?

Determines how data replicas are located across the cluster nodes

What are SSTables in Cassandra?

Immutable data files on disk where data is flushed from memtables

What are secondary indexes in Cassandra?

Allow querying data based on non-primary key columns but can impact write performance

What is a set in Cassandra?

A collection data type representing a collection of sorted and unique elements

What is a snapshot in Cassandra?

A backup of a keyspace or table taken prior to certain operations

What is a static column in Cassandra?

A special single-value column shared by all partition rows in a dynamic table

What is a static table in Cassandra?

A table where the primary key contains only a single column, acting as both partition key and identifier

What does Time To Live (TTL) mean in Cassandra?

An optional parameter that sets an expiration time for data

What is a timestamp in Cassandra?

A value attached to every write operation, used for data reconciliation

What is a tombstone in Cassandra?

A marker written during a delete operation to indicate that a specific piece of data has been deleted

What does the truncate command do in Cassandra?

Removes all data from a specified table but retains the table's schema

What are user-defined data types (UDTs) in Cassandra?

Custom data types created by the user in CQL, allowing multiple data types to be attached to a single column

Explain how the no-read-before-write rule in Cassandra differs from how write operations are executed in relational databases

🔄 What Does “No Read Before Write” Mean? In Apache Cassandra, when you run an INSERT or UPDATE, Cassandra does **not** check if the row or column already exists. * It simply writes the data with a timestamp, and the most recent value (highest timestamp) wins during reads or compaction. ⚖️ In Contrast: Relational Databases In relational databases, write operations often require a read first, especially when maintaining: * Constraints (e.g., uniqueness, foreign keys) * Triggers * Atomic operations

How does the following write operation differs between Cassandra and a relational database (e.g. MySQL) in terms of execution? ``` INSERT INTO users (id, name, email) VALUES (1, 'Alice', 'alice@example.com'); ```

Cassandra * **No check** is made to see if a user with id=1 already exists. * If it exists, **data is overwritten** (depending on the timestamp). PostgreSQL * If **id=1 already exists** (primary key), it **fails** with a duplicate key error. * You need to use ON CONFLICT to allow upserts:

What is an upsert?

An upsert is a combination of “**up**date” and “in**sert**”. It means: ✅ Insert the row if it doesn’t exist, 🔄 Update the existing row if it does. Example: Let’s say you’re saving user information: * If the user with `ID = 42` already exists → Update their email. * If they don’t exist yet → Insert a new user with `ID = 42`. 📌 In Cassandra, upserts are always performed by **default** — there is no distinction between `INSERT` and `UPDATE`. `INSERT INTO users (id, name) VALUES (42, 'Alice');` `UPDATE users SET name = 'Alice' WHERE id = 42;` 🛠️ Both do the same thing and **no read is performed beforehand**. 🧪 In Relational Databases (like PostgreSQL or MySQL), you need special syntax: ``` INSERT INTO users (id, name) VALUES (42, 'Alice') ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name; ``` This tells the DB: * If `id=42` exists → update it. * Else → insert it.

How do prevent accidental overwrites of important data in Cassandra given the *no-read-before-write rule*?

✅ Option 1: Use Lightweight Transactions (LWT) ``` INSERT INTO users (id, email) VALUES (123, 'alice@example.com') IF NOT EXISTS; ``` ✅ Option 2: Handle It in the Client/Application Layer You can implement application-level checks to avoid overwriting. First, do a read: `SELECT * FROM users WHERE id = 123;` If no row is returned, proceed with the insert: `INSERT INTO users (id, email) VALUES (123, 'alice@example.com');` ✅ Option 3: Use UUIDs to Avoid Collisions Sometimes, the best strategy is to generate unique keys (e.g., UUID, TIMEUUID) so overwrites can’t happen by design.

Answer 28

Cassandra checks both during a read operation. * MemTable(s): Cassandra **first checks all live MemTables** for the latest data. * SSTables: Then it **checks SSTables on disk**, possibly multiple files. * Bloom Filters + Indexes: Used to skip unnecessary SSTables. * Timestamps: Cassandra compares timestamps to return the latest version of the data.

Answer 29

I don’t need to write different queries for MemTable vs SSTable. From the client’s point of view, it’s totally transparent: `SELECT * FROM users WHERE user_id = 123;` Whether the data is in memory or on disk, Cassandra will: * Gather it, * Reconcile versions (if needed), * Return the latest consistent view based on your consistency level.

Answer 30

In Cassandra, you don’t **normalize** data. Instead, you design tables **based on** your queries (**your access patterns**). 💡 Ask yourself: “What questions will my app ask most frequently?” → Design a table **for each**. Let’s say your most common query is: *“Show me all orders for a user, with products and totals, sorted by date.”* You **denormalize** the data into a single table to make that query efficient: ``` CREATE TABLE user_orders ( user_id UUID, order_date TIMESTAMP, order_id UUID, product_id UUID, product_name TEXT, quantity INT, total DECIMAL, PRIMARY KEY ((user_id), order_date, order_id, product_id) ``` The query to read data would be: `SELECT * FROM user_orders WHERE user_id = ?;` Comparatively, in traditional relational databases, you would need to create four tables (users, orders, products, order_items) to ensure data normalization. Then, to read data, you would need to perform joins to combine data from different tables.

Answer 31

We can use the collection type `list` to include multiple values associated with the same instance. See example attached.

Answer 32

Cassandra CQL data types are grouped into three main categories: built-in, collection, and user-defined. Built-in types are predefined, straightforward types like text, int, and boolean. Collection types group data in a single column (e.g., LIST, MAP, SET), while user-defined types allow users to create custom complex data types.

Answer 33

Instructs Cassandra to **look** for the data, **read** it, and only then **perform** a given operation. This is an **exception** to the standard *no-read-before-write* rule implemented in Cassandra.

Answer 34

The BLOB data type in Cassandra stands for Binary Large Object. It is used to store binary data like images, audio, or other multimedia objects. While Cassandra primarily handles text-based information, blobs offer the flexibility to store non-textual data, though it's recommended that their size does not exceed 1 megabyte.

Answer 35

Collection data types group and store related data within a single column, emulating one-to-many relationships without requiring joins, which Cassandra avoids. For example, a LIST can store multiple email addresses for a user. However, collections should be avoided if the data has unbounded growth potential, such as frequently logged sensor events, as this can lead to large partitions and performance issues.

Answer 36

A User-Defined Type (UDT) in Cassandra allows users to create their own custom data types by combining multiple existing data types into a single column. UDTs are particularly useful for emulating one-to-one relationships, such as storing an entire address (apartment number, building, street) as a single entity within a column.

Answer 37

A Cassandra Keyspace is the highest-level organisational unit for data, analogous to a database in relational systems. It is crucial to define a keyspace before creating tables because there is no default keyspace. The keyspace also dictates important cluster-level properties like the replication factor and strategy for all its contained tables.

Answer 38

Replication Factor refers to the total number of copies of data that should be stored across different nodes for fault tolerance and high availability. Replication Strategy, on the other hand, determines how and where these data replicas are going to be located across the cluster nodes, considering factors like data centres and racks.

Answer 39

By default, Cassandra does not perform a read operation before a write. This means that both INSERT and UPDATE operations behave similarly as "upserts." If you INSERT data into an existing primary key, it will update the existing record; similarly, if you UPDATE a non-existent record, it will create it.

Answer 40

During a write operation at the cluster level, the node that receives the write request becomes the coordinator. Its role is to ensure the completion of the operation by directing the write to all relevant replicas of the partition and then collecting acknowledgements from at least the minimum number of nodes specified by the consistency setting. Once successful, it returns the result to the user.

Answer 41

Full table scans are strongly discouraged in Cassandra for production systems because they are resource-intensive and compromise performance, especially for large datasets. They involve sending the request to all nodes in the cluster. For optimal query performance, queries should always start with the partition key, followed by clustering keys in their defined order, to limit reads to specific partitions.

Answer 42

A tombstone in Cassandra is a special marker written during a delete operation, indicating that specific data has been logically deleted at a certain timestamp. They are necessary in Cassandra's peer-to-peer architecture to ensure that even if a delete operation doesn't immediately propagate to all replicas, subsequent reads reconcile the data based on timestamps, with the tombstone signalling the most recent "deleted" state.

Answer 43

A built-in data type for a 64-bit signed long integer

Answer 44

Binary Large Object

Answer 45

1 megabyte

Answer 46

* Ascii * Boolean * Decimal * Double * Float * Int * Text * Bigint * Blob

Answer 47

Components of the primary key that determine the order of data within a partition

Answer 48

Data types (Lists, Maps, Sets) used to group and store data together in a single column

Answer 49

Records all mutations to ensure data durability before they are flushed to SSTables

Answer 50

An optimization process that merges multiple SSTables into a single, larger SSTable

Answer 51

Guarantee that all nodes in a distributed system have the same data at the same time

Answer 52

The node that receives a client's read or write request and coordinates the operation

Answer 53

A group of interconnected servers or nodes that work together to store and manage data

Answer 54

A table whose primary key is composed of both a partition key and a clustering key

Answer 55

A configurable period after which tombstones are permanently removed during compaction

Answer 56

Operations that enforce a read-before-write behaviour, providing stronger consistency guarantees. They are an exception to Cassandra's normal write behavior, and are built on the principle of: ✅ “Write this data only if some condition is true.” To evaluate the condition, Cassandra must read the existing data first. ## Footnote The tradeoff of using LWTs i that they are much slower that standard operations, so they need to be used sparingly

Answer 57

A collection data type representing an ordered collection of one or more elements

Answer 58

A collection data type representing a collection of key-value pairs

Answer 59

An in-memory data structure where writes are initially stored before being flushed to disk

Answer 60

The recommended replication strategy for production systems, specifying data placement per data centre

Answer 61

A component of the primary key that determines how data is distributed across nodes

Answer 62

Consists of one or more columns that uniquely identify rows in a table

Answer 63

Specifies the number of copies of data (replicas) stored across different nodes

Answer 64

Determines how data replicas are located across the cluster nodes

Answer 65

Immutable data files on disk where data is flushed from memtables

Answer 66

Allow querying data based on non-primary key columns but can impact write performance

Answer 67

A collection data type representing a collection of sorted and unique elements

Answer 68

A backup of a keyspace or table taken prior to certain operations

Answer 69

A special single-value column shared by all partition rows in a dynamic table

Answer 70

A table where the primary key contains only a single column, acting as both partition key and identifier

Answer 71

An optional parameter that sets an expiration time for data

Answer 72

A value attached to every write operation, used for data reconciliation

Answer 73

A marker written during a delete operation to indicate that a specific piece of data has been deleted

Answer 74

Removes all data from a specified table but retains the table's schema

Answer 75

Custom data types created by the user in CQL, allowing multiple data types to be attached to a single column

Answer 76

🔄 What Does “No Read Before Write” Mean? In Apache Cassandra, when you run an INSERT or UPDATE, Cassandra does **not** check if the row or column already exists. * It simply writes the data with a timestamp, and the most recent value (highest timestamp) wins during reads or compaction. ⚖️ In Contrast: Relational Databases In relational databases, write operations often require a read first, especially when maintaining: * Constraints (e.g., uniqueness, foreign keys) * Triggers * Atomic operations

Answer 77

Cassandra * **No check** is made to see if a user with id=1 already exists. * If it exists, **data is overwritten** (depending on the timestamp). PostgreSQL * If **id=1 already exists** (primary key), it **fails** with a duplicate key error. * You need to use ON CONFLICT to allow upserts:

Answer 78

An upsert is a combination of “**up**date” and “in**sert**”. It means: ✅ Insert the row if it doesn’t exist, 🔄 Update the existing row if it does. Example: Let’s say you’re saving user information: * If the user with `ID = 42` already exists → Update their email. * If they don’t exist yet → Insert a new user with `ID = 42`. 📌 In Cassandra, upserts are always performed by **default** — there is no distinction between `INSERT` and `UPDATE`. `INSERT INTO users (id, name) VALUES (42, 'Alice');` `UPDATE users SET name = 'Alice' WHERE id = 42;` 🛠️ Both do the same thing and **no read is performed beforehand**. 🧪 In Relational Databases (like PostgreSQL or MySQL), you need special syntax: ``` INSERT INTO users (id, name) VALUES (42, 'Alice') ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name; ``` This tells the DB: * If `id=42` exists → update it. * Else → insert it.

Answer 79

✅ Option 1: Use Lightweight Transactions (LWT) ``` INSERT INTO users (id, email) VALUES (123, 'alice@example.com') IF NOT EXISTS; ``` ✅ Option 2: Handle It in the Client/Application Layer You can implement application-level checks to avoid overwriting. First, do a read: `SELECT * FROM users WHERE id = 123;` If no row is returned, proceed with the insert: `INSERT INTO users (id, email) VALUES (123, 'alice@example.com');` ✅ Option 3: Use UUIDs to Avoid Collisions Sometimes, the best strategy is to generate unique keys (e.g., UUID, TIMEUUID) so overwrites can’t happen by design.

Apache Cassandra Basics Flashcards

(104 cards)