Wide Column Stores (Cassandra, HBase) Flashcards
Handle high-throughput, write-optimized distributed workloads. Column-family structure (rows, partitions, clustering columns) Cassandra data model and CQL (Cassandra Query Language) Partitioning and data distribution (consistent hashing) Tunable consistency in Cassandra (QUORUM, ONE, ALL) Replication and replication factor Write path and read path in Cassandra HBase architecture (RegionServers, HMaster) Use cases: time-series data, analytics platforms (40 cards)
How does Cassandra achieve high availability?
Cassandra uses peer-to-peer architecture, replication across nodes, and tunable consistency levels to ensure high availability even during node failures.
Explain the data model of Cassandra.
Cassandra’s data model is based on a wide-column structure with tables, rows, and columns grouped by partitions and clustering keys.
What workloads is Cassandra optimized for?
Cassandra is optimized for high-throughput, write-heavy, distributed workloads with linear scalability and high availability.
What is the column-family structure in Cassandra?
A column family is similar to a table, where data is organized into rows grouped by partition keys, and within each partition, sorted by clustering columns.
What is a partition key in Cassandra?
A partition key determines the node where the data is stored using consistent hashing, and groups rows into partitions.
What are clustering columns in Cassandra?
Clustering columns define how data is sorted within a partition and support range queries on related rows.
What is CQL (Cassandra Query Language)?
CQL is a SQL-like language used to interact with Cassandra, including creating tables, inserting data, and querying.
What is consistent hashing in Cassandra?
Consistent hashing is used to distribute data across nodes evenly by assigning token ranges to nodes.
What is tunable consistency in Cassandra?
Cassandra allows clients to choose consistency levels like ONE, QUORUM, or ALL for reads and writes, balancing latency and data accuracy.
What is QUORUM in Cassandra consistency?
QUORUM requires a majority of replicas to respond, offering a balance between consistency and availability.
What does consistency level ONE mean in Cassandra?
ONE means only one replica must respond, favoring availability and speed over consistency.
What does consistency level ALL mean in Cassandra?
ALL requires all replicas to acknowledge the read/write, ensuring strong consistency but reducing availability.
What is replication in Cassandra?
Replication refers to storing multiple copies of data across different nodes to ensure redundancy and fault tolerance.
What is replication factor in Cassandra?
The replication factor determines how many copies of data are stored; e.g., a factor of 3 means 3 copies.
What is the write path in Cassandra?
Writes go to the commit log, then to an in-memory Memtable, and are eventually flushed to SSTables on disk.
What is the read path in Cassandra?
Reads involve checking the Memtable, Bloom filters, SSTables, and potentially multiple replicas based on consistency level.
What is HBase?
HBase is an open-source, distributed, scalable wide-column store built on top of HDFS for real-time read/write access to large datasets.
What are RegionServers in HBase?
RegionServers handle read and write requests for regions (subsets of tables), store HFiles and MemStores, and manage compactions.
What is the HMaster in HBase?
HMaster coordinates RegionServers, handles schema changes, and manages region assignment and load balancing.
What are use cases for Cassandra?
Common use cases include time-series data, IoT, messaging systems, recommendation engines, and real-time analytics.
What are use cases for HBase?
HBase is used in data lakes, batch and real-time analytics, time-series storage, and storing sparse datasets.
What are wide-column stores?
Wide-column stores are NoSQL databases that store data in tables, rows, and dynamic columns grouped by column families (e.g., Cassandra, HBase).
What are advantages of wide-column stores?
Advantages include high write throughput, horizontal scalability, flexible schemas, and efficient handling of sparse data.
What are disadvantages of wide-column stores?
Disadvantages include complex data modeling, limited ad-hoc querying, and higher learning curve compared to relational databases.