Week 11: Fast Data Streaming in the Cloud Flashcards by John Doe

Why do we need real-time stream processing systems? What are the fundamental goals of streaming analytics?

Continuous ingestion of data with very low latency, enabling near‑instant feedback loops and minimizing data staleness for critical business decisions.

Dynamic scaling to handle fluctuating workloads and resilient pipelines that sustain heavy throughput.

Support for real‑time dashboards and rapid decision‑making across domains such as fraud detection, IoT sensor monitoring, and user personalization.

How well did you know this?

Not at all

Perfectly

How do batch and streaming architectures differ, and what are their ideal use cases?

Batch: Accumulates data over fixed intervals; ideal for aggregated reporting, historical analytics, and workloads that tolerate higher latency.

Streaming: Processes data continuously as it arrives; ideal for real‑time monitoring, event‑driven actions, and applications requiring sub‑second responsiveness.

Hybrid: Combines both to balance cost, latency, and operational overhead depending on workload patterns.

How well did you know this?

Not at all

Perfectly

What are the typical components of a streaming data pipeline (e.g., producers, brokers, stream processors, sinks)?

Producers – Real‑time event sources (e.g., sensors, user interactions, service logs)

Brokers (Ingestion Layer) – Message queues or services (e.g., Kafka, Kinesis, Pulsar) that buffer and distribute events

Stream Processors – Engines that perform transformations, aggregations, filtering, and routing on the fly

Sinks – Storage or serving layers (e.g., databases, dashboards, analytics APIs) where processed results are written for consumption

How well did you know this?

Not at all

Perfectly

What challenges arise in stream processing, especially event vs. processing time and exactly-once semantics?

Event vs. Processing Time:

Event time is when the action actually occurred; processing time is when it’s seen by the pipeline.

Out‑of‑order or late‑arriving events require watermarking and holding windows open for a configurable delay to ensure correct results.

Delivery Semantics:

Ensuring exactly‑once processing demands sophisticated checkpointing, state management, and idempotent writes.

Without it, pipelines may exhibit at‑least‑once or at‑most‑once behaviours, risking duplicates or data loss.

Scaling & Fault Tolerance: Managing high throughput and maintaining consistent state across distributed nodes adds further complexity.

How well did you know this?

Not at all

Perfectly

What motivated the shift from batch-only systems to real-time architectures in big data pipelines?

As Hadoop enabled large‑scale batch availability, organizations still needed immediate insights and instant feedback loops for analytics and decision‑making

End users “always have wanted data faster,” and by the early 2010s, in‑memory streaming at billions of events/day became cost‑effective thanks to commodity hardware

How well did you know this?

Not at all

Perfectly

What are the key elements of Apache Storm’s processing model, including topologies, spouts, and bolts?

Topologies: Directed acyclic graphs of spouts and bolts that run continuously (no batch boundaries) processing unbounded streams

Spouts: Sources of input streams (e.g., event generators) that emit tuples into the topology

Bolts: Processing units that transform, filter, aggregate, or join tuples; special “sink” bolts write results out to external systems

How well did you know this?

Not at all

Perfectly

How do hybrid systems like Lambda Architecture combine batch and real-time processing layers to ensure accuracy and responsiveness?

Batch Layer (e.g., Hadoop): Ingests all data to compute a comprehensive, accurate “master” dataset via transforms, joins, validation, and aggregations

Speed Layer (e.g., Storm): Processes real‑time events for low‑latency, incremental updates

Serving Layer (e.g., Druid): Merges outputs from both layers to provide queries that are both up‑to‑date and correct, balancing freshness with accuracy

How well did you know this?

Not at all

Perfectly

What are the core components of the Lambda Architecture, and how does it manage failures and state consistency?

Batch Layer: Ingests all raw data and periodically recomputes accurate views to recover from any errors or data loss

Speed (Real‑Time) Layer: Processes incoming events immediately, updating an in‑memory key–value store; must guarantee idempotent updates to avoid duplicates

Serving Layer: Merges batch and speed views into a unified, queryable result for low‑latency access

Failure & Consistency Handling: Batch recomputation corrects any mistakes from the speed layer; stream processors use idempotency and checkpointing of their in‑memory state to ensure exactly‑once semantics

How well did you know this?

Not at all

Perfectly

How does the Kappa Architecture differ from Lambda in terms of simplicity and reliance on a single streaming path?

Single Path: Eliminates the separate batch layer—everything is handled as a continuous stream

Event Log Replay: Historical reprocessing achieved by replaying the immutable event log through the same stream processor

Reduced Operational Complexity: One codebase and one processing framework to maintain

State Management: Can leverage micro‑batching within the streaming engine or stateful stream stores, but avoids maintaining dual pipelines

How well did you know this?

Not at all

Perfectly

What is the role of Amazon Data Firehose in building serverless ETL pipelines for real-time data delivery?

Ingests streaming data and delivers it automatically to storage or analytics services without needing servers

Manages buffering, encryption, scaling, retries and backups under the hood

Enables near‑real‑time ETL with minimal custom code, lowering operational overhead and storage costs

How well did you know this?

Not at all

Perfectly

How does Firehose handle buffering, transformations, and delivery to destinations like S3, Redshift, or OpenSearch?

Buffering: Spans multiple AZs; batches data based on configurable size or time intervals to balance latency vs. throughput

Transformations: Optionally invokes AWS Lambda inline for JSON parsing, format conversion (e.g. Parquet), enrichment or filtering

Delivery: Auto‑scales throughput and writes batch files to S3, loads into Redshift or indexes into OpenSearch, with built‑in retries, backups, and encryption

How well did you know this?

Not at all

Perfectly

What are the benefits and use cases of inline data transformations using AWS Lambda in Firehose?

Perform lightweight ETL on the fly: JSON parsing/field extraction, format conversion to Parquet, schema validation

Enrich or filter records before storage, ensuring only clean, compliant data is persisted

Handle transformation errors gracefully by sending faulty records to S3 backup

Reduces downstream storage/query costs and simplifies downstream pipeline logic

How well did you know this?

Not at all

Perfectly

What are the key features of Amazon Managed Service for Apache Flink, and how does it support real-time analytics?

Fully managed Flink runtime with event‑time semantics, checkpointing/savepoints, and exactly‑once guarantees

Scalable parallel execution via JobManager (coordinator) and TaskManagers (workers)

Automatic restarts on failures and AWS‑managed scaling/version upgrades

Native support for windowed aggregations, streaming joins, and integrations with S3, DynamoDB, Kinesis, Kafka, etc.

How well did you know this?

Not at all

Perfectly

How can Flink be configured for streaming jobs using SQL or APIs, and what are the best practices for deployment and monitoring?

Configuration:
- Deploy via AWS console by uploading a JAR or defining SQL; choose SQL mode for declarative queries or DataStream/Table APIs for custom logic
- Specify parallelism and checkpoint intervals; connect to Kinesis Data Streams or Kafka as source
Best Practices:
- Enable detailed logging and CloudWatch metrics for throughput, latency, and resource usage
- Test in a staging environment under scaled traffic
- Tune checkpointing, watermark strategies, and state TTLs
- Securely rotate credentials (e.g., for Redshift) and partition outputs (e.g., time‑based S3 prefixes)

How well did you know this?

Not at all

Perfectly

What problems did earlier streaming frameworks face with duplicate processing and partial updates?

At‑least‑once delivery (e.g. Storm’s acking) led to duplicate records unless sinks implemented manual deduplication.

Partial writes to external ACID stores meant failures could leave inconsistent state.

Developers had to track message IDs or version columns and build Lambda/Kappa layers to reconcile real‑time and batch views

How well did you know this?

Not at all

Perfectly

How does Flink use checkpoint barriers and operator state to achieve exactly-once processing?

Study These Flashcards

Checkpoint Barriers: Periodically injected into streams; barriers flow downstream, pausing outputs and buffering incoming data until all inputs align.

Operator State Snapshots: Once aligned, each operator snapshots its local (keyed or operator‑wide) state to durable storage.

Recovery: On failure, Flink restores state from the last completed checkpoint and replays upstream data, ensuring no partial snapshots or duplicate processing .

What is the purpose of two-phase commit for external sinks in Flink, and how does it ensure consistency?

Study These Flashcards

Prepare Phase: During a checkpoint, the sink writes data in a “pending” or transactional state.

Commit Phase: On successful checkpoint completion, the sink finalizes (commits) those writes atomically.

Rollback on Failure: If the checkpoint fails, the sink discards pending writes, preventing partial updates and guaranteeing exactly‑once delivery to external systems

How does Flink unify real-time correctness with recovery strategies, eliminating the need for Lambda/Kappa reconciliation?

Study These Flashcards

Co‑located Compute & State: Operators hold state locally, reducing external write dependencies.

Unified Checkpoint‑Based Snapshots: Same mechanism drives both real‑time processing and historical replays.

Single Pipeline for Live & Backfill: Recovery via checkpoint restore and log replay uses identical code paths, removing separate batch layers and manual merge logic

What are the core features of Apache Flink’s architecture?

Study These Flashcards

Native event‑driven runtime: Processes records as true streams, not micro‑batches

Stateful stream computations: Built‑in keyed and operator state with distributed scaling

Exactly‑once semantics: Checkpointing and two‑phase commits ensure end‑to‑end consistency

JobManager & TaskManagers: Central coordinator (JobManager) and worker nodes (TaskManagers with slots and operator chaining) for parallel execution

Pipelined data exchange: Continuous pipelines for low latency, with fault‑tolerance via coordinated checkpoints/savepoints

How do the DataStream, Table, and SQL APIs support a wide range of streaming use cases in Flink?

Study These Flashcards

DataStream API:
- Unified model for streaming & batch; high‑/low‑level operators (map, filter, flatMap)
- Stateful processing with keyed state, event‑time watermarks, windows, custom logic
- Connectors for Kafka, Kinesis, file systems, etc.

Table API:
- Relational, schema‑based abstraction over DataStream/DataSet
- SQL‑like transformations, time attributes for event‑time grouping, joins, aggregations
- Single codebase for both batch and streaming

SQL API:
- Standard SQL over streaming tables with streaming extensions (OVER windows, temporal joins)
- Integrated event‑time semantics and same optimizer/runtime as Table API

What is Spark Streaming’s discretized stream processing model, and how does it differ from traditional record-at-a-time systems?

Study These Flashcards

Discretized model: chops the live data stream into micro‑batches of fixed interval (e.g., X seconds) and runs each batch as a Spark job.

Traditional model: processes one record at a time, maintaining mutable in‑node state that is lost on failure.

Key difference: Spark Streaming treats each interval as immutable RDDs (resilient batches), avoiding per‑record state loss and enabling fault tolerance via RDD lineage .

How does Spark Streaming use micro-batches (DStreams) to achieve fault-tolerant, near-real-time processing?

Study These Flashcards

DStreams: represent a continuous stream as a sequence of RDDs, one per batch interval.

Processing: each RDD is processed with standard Spark transformations and actions.

Fault tolerance: RDDs are recomputed from lineage on failure, so any lost partition can be rebuilt automatically.

Latency: batch intervals as low as 0.5 s yield end‑to‑end latencies around 1 s .

What is the role of RDDs in Spark Streaming’s computation model?

Study These Flashcards

Each micro‑batch of incoming data is materialized as one or more Resilient Distributed Datasets (RDDs).

RDDs provide:

Immutable, partitioned data for parallel processing

Lineage information for automatic recomputation on failure

Seamless integration with Spark’s core APIs (SQL, MLlib, GraphX)

What types of data sources can Spark Streaming integrate with out-of-the-box (e.g., Kafka, HDFS, TCP sockets)?

Study These Flashcards

Kafka

HDFS

Flume

Akka Actors

Raw TCP sockets

(Plus custom sources via easy‑to‑write receivers)

What advantages and limitations does Spark Streaming have compared to real-time engines like Flink or Storm?

Advantages: Rich Spark ecosystem (Spark SQL, MLlib, GraphX, SparkR) for unified batch/stream workloads. Familiar RDD abstraction and APIs. Limitations: Not a true record‑at‑a‑time engine—micro‑batching introduces inherent latency. Window granularity tied to batch interval, limiting sub‑second responsiveness compared to native streaming engines

What are the key stages of a streaming data pipeline (ingestion to distributed queues) and real-time processing?

Data Ingestion / Funnel: Collect and pre‑process events (e.g., Apache NiFi) Distributed Queue: Buffer and decouple producers/consumers via a pub‑sub log (e.g., Kafka) Real‑Time Processing: Apply low‑latency computations (e.g., Storm’s record‑at‑a‑time) or micro‑batch analytics (e.g., Spark Streaming)

How does Apache NiFi support visual flow design, data transfer, and site-to-site movement across clusters?

Visual Flow UI: Drag‑and‑drop processors and connections to define data routes * Core Abstractions: FlowFiles (content + attributes), Processors, Connections, Process Groups * Real‑Time Control: Start/stop/configure components on the fly, monitor stats/errors * Site‑to‑Site: Native protocol to push/pull FlowFiles cluster‑to‑cluster for global data funnels

How does Kafka serve as a distributed, durable, and high-throughput pub-sub system for decoupling producers and consumers?

Append‑Only Log: Messages persisted in topic partitions across brokers for durability * Replication & Coordination: ZooKeeper‑backed leader/follower replicas ensure fault tolerance * High Throughput: Sequential disk writes + zero‑copy network transfer → million+ messages/sec * Decoupling: Producers publish to topics; consumers independently subscribe and track offsets

How do systems like Storm and Spark handle processing, and how does the choice impact latency, throughput, and state management?

* Storm: True event‑driven model (spouts → bolts → tuples); processing per record → sub‑ms to 10 ms latency; state via bolts or external stores * Spark Streaming: Micro‑batch model (DStreams = sequence of RDDs); batch interval dictates latency (≥0.5 s); high throughput + fault tolerance via RDD lineage/checkpointing

What role does Druid play in enabling fast OLAP queries, and how is it optimized using columnar storage, bitmap indexes, and real-time ingestion?

* Real‑Time OLAP Store: Ingests streaming data and serves sub‑second, slice‑and‑dice queries * Columnar Storage: Organizes data by column for I/O efficiency * Bitmap Indexes: Dictionary‑encoded, 1 bit per record; bitwise AND/OR accelerates filters * Scalable Ingestion: Native real‑time streaming pipelines ensure up‑to‑date analytics

Week 11: Fast Data Streaming in the Cloud Flashcards

we'll explore Fast Data real-time streaming. We'll also introduce Storm technology, which is used widely in industries such as Yahoo. We then continue our exploration of Storm technology. We also will learn about Spark Streaming, Lambda and Kappa architectures (30 cards)