Wide Column Stores (Cassandra, HBase) Flashcards

Handle high-throughput, write-optimized distributed workloads. Column-family structure (rows, partitions, clustering columns) Cassandra data model and CQL (Cassandra Query Language) Partitioning and data distribution (consistent hashing) Tunable consistency in Cassandra (QUORUM, ONE, ALL) Replication and replication factor Write path and read path in Cassandra HBase architecture (RegionServers, HMaster) Use cases: time-series data, analytics platforms (40 cards)

1
Q

How does Cassandra achieve high availability?

A

Cassandra uses peer-to-peer architecture, replication across nodes, and tunable consistency levels to ensure high availability even during node failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the data model of Cassandra.

A

Cassandra’s data model is based on a wide-column structure with tables, rows, and columns grouped by partitions and clustering keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What workloads is Cassandra optimized for?

A

Cassandra is optimized for high-throughput, write-heavy, distributed workloads with linear scalability and high availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the column-family structure in Cassandra?

A

A column family is similar to a table, where data is organized into rows grouped by partition keys, and within each partition, sorted by clustering columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a partition key in Cassandra?

A

A partition key determines the node where the data is stored using consistent hashing, and groups rows into partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are clustering columns in Cassandra?

A

Clustering columns define how data is sorted within a partition and support range queries on related rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is CQL (Cassandra Query Language)?

A

CQL is a SQL-like language used to interact with Cassandra, including creating tables, inserting data, and querying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is consistent hashing in Cassandra?

A

Consistent hashing is used to distribute data across nodes evenly by assigning token ranges to nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is tunable consistency in Cassandra?

A

Cassandra allows clients to choose consistency levels like ONE, QUORUM, or ALL for reads and writes, balancing latency and data accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is QUORUM in Cassandra consistency?

A

QUORUM requires a majority of replicas to respond, offering a balance between consistency and availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does consistency level ONE mean in Cassandra?

A

ONE means only one replica must respond, favoring availability and speed over consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does consistency level ALL mean in Cassandra?

A

ALL requires all replicas to acknowledge the read/write, ensuring strong consistency but reducing availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is replication in Cassandra?

A

Replication refers to storing multiple copies of data across different nodes to ensure redundancy and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is replication factor in Cassandra?

A

The replication factor determines how many copies of data are stored; e.g., a factor of 3 means 3 copies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the write path in Cassandra?

A

Writes go to the commit log, then to an in-memory Memtable, and are eventually flushed to SSTables on disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the read path in Cassandra?

A

Reads involve checking the Memtable, Bloom filters, SSTables, and potentially multiple replicas based on consistency level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is HBase?

A

HBase is an open-source, distributed, scalable wide-column store built on top of HDFS for real-time read/write access to large datasets.

18
Q

What are RegionServers in HBase?

A

RegionServers handle read and write requests for regions (subsets of tables), store HFiles and MemStores, and manage compactions.

19
Q

What is the HMaster in HBase?

A

HMaster coordinates RegionServers, handles schema changes, and manages region assignment and load balancing.

20
Q

What are use cases for Cassandra?

A

Common use cases include time-series data, IoT, messaging systems, recommendation engines, and real-time analytics.

21
Q

What are use cases for HBase?

A

HBase is used in data lakes, batch and real-time analytics, time-series storage, and storing sparse datasets.

22
Q

What are wide-column stores?

A

Wide-column stores are NoSQL databases that store data in tables, rows, and dynamic columns grouped by column families (e.g., Cassandra, HBase).

23
Q

What are advantages of wide-column stores?

A

Advantages include high write throughput, horizontal scalability, flexible schemas, and efficient handling of sparse data.

24
Q

What are disadvantages of wide-column stores?

A

Disadvantages include complex data modeling, limited ad-hoc querying, and higher learning curve compared to relational databases.

25
What are best practices for Cassandra schema design?
Model based on queries, use composite keys wisely, avoid large partitions, and keep write and read paths efficient.
26
What are best practices for HBase usage?
Design row keys for even distribution, monitor region splits, use bulk loading for batch inserts, and leverage TTLs for cleanup.
27
What are architectural implications of using Cassandra?
Cassandra enables decentralized architecture with no single point of failure, supporting global distribution and eventual consistency.
28
What are architectural implications of using HBase?
HBase relies on Hadoop’s HDFS for storage and uses a master-slave model, requiring integration with HDFS and Zookeeper.
29
How do wide-column stores perform under load?
They scale horizontally with high availability and write throughput, especially when using proper partitioning and compaction strategies.
30
How do Cassandra and HBase provide fault tolerance?
Cassandra uses replication and hinted handoff; HBase uses HDFS replication and RegionServer failover managed by HMaster.
31
How can you monitor Cassandra?
Tools like Prometheus, Grafana, DataStax OpsCenter, and `nodetool` are used to monitor latency, disk usage, and node health.
32
How can you monitor HBase?
HBase metrics are available via JMX and can be visualized using Grafana, Ganglia, or Cloudera Manager.
33
How do you debug performance issues in Cassandra?
Use tracing, slow query logs, `nodetool` stats, and check for hotspots, tombstones, and GC pauses.
34
How do you debug HBase issues?
Check region logs, use HBase UI, review HDFS and ZooKeeper health, and analyze compaction and GC logs.
35
What are real-world tradeoffs of using Cassandra?
Cassandra offers high availability and scalability but trades off immediate consistency and requires careful data modeling.
36
What are real-world tradeoffs of using HBase?
HBase provides strong consistency and Hadoop integration but has operational complexity and less flexibility in querying.
37
What are common Cassandra interview questions?
Examples: 'Explain tunable consistency', 'What is a partition key?', 'Describe the write path', 'How does Cassandra handle replication?'
38
What are common HBase interview questions?
Examples: 'What is a RegionServer?', 'How does HBase store data?', 'Describe HMaster role', 'Explain compaction in HBase.'
39
What are potential gotchas in Cassandra?
Hot partitions, large tombstones, unbounded row growth, and misconfigured consistency levels can cause performance issues.
40
What are potential gotchas in HBase?
Region hotspots, unbalanced region splits, improper row key design, and compaction storms can affect performance.