Week 11: Fast Data Streaming in the Cloud Flashcards
we'll explore Fast Data real-time streaming. We'll also introduce Storm technology, which is used widely in industries such as Yahoo. We then continue our exploration of Storm technology. We also will learn about Spark Streaming, Lambda and Kappa architectures (30 cards)
Why do we need real-time stream processing systems? What are the fundamental goals of streaming analytics?
Continuous ingestion of data with very low latency, enabling near‑instant feedback loops and minimizing data staleness for critical business decisions.
Dynamic scaling to handle fluctuating workloads and resilient pipelines that sustain heavy throughput.
Support for real‑time dashboards and rapid decision‑making across domains such as fraud detection, IoT sensor monitoring, and user personalization.
How do batch and streaming architectures differ, and what are their ideal use cases?
Batch: Accumulates data over fixed intervals; ideal for aggregated reporting, historical analytics, and workloads that tolerate higher latency.
Streaming: Processes data continuously as it arrives; ideal for real‑time monitoring, event‑driven actions, and applications requiring sub‑second responsiveness.
Hybrid: Combines both to balance cost, latency, and operational overhead depending on workload patterns.
What are the typical components of a streaming data pipeline (e.g., producers, brokers, stream processors, sinks)?
Producers – Real‑time event sources (e.g., sensors, user interactions, service logs)
Brokers (Ingestion Layer) – Message queues or services (e.g., Kafka, Kinesis, Pulsar) that buffer and distribute events
Stream Processors – Engines that perform transformations, aggregations, filtering, and routing on the fly
Sinks – Storage or serving layers (e.g., databases, dashboards, analytics APIs) where processed results are written for consumption
What challenges arise in stream processing, especially event vs. processing time and exactly-once semantics?
Event vs. Processing Time:
Event time is when the action actually occurred; processing time is when it’s seen by the pipeline.
Out‑of‑order or late‑arriving events require watermarking and holding windows open for a configurable delay to ensure correct results.
Delivery Semantics:
Ensuring exactly‑once processing demands sophisticated checkpointing, state management, and idempotent writes.
Without it, pipelines may exhibit at‑least‑once or at‑most‑once behaviours, risking duplicates or data loss.
Scaling & Fault Tolerance: Managing high throughput and maintaining consistent state across distributed nodes adds further complexity.
What motivated the shift from batch-only systems to real-time architectures in big data pipelines?
As Hadoop enabled large‑scale batch availability, organizations still needed immediate insights and instant feedback loops for analytics and decision‑making
End users “always have wanted data faster,” and by the early 2010s, in‑memory streaming at billions of events/day became cost‑effective thanks to commodity hardware
What are the key elements of Apache Storm’s processing model, including topologies, spouts, and bolts?
Topologies: Directed acyclic graphs of spouts and bolts that run continuously (no batch boundaries) processing unbounded streams
Spouts: Sources of input streams (e.g., event generators) that emit tuples into the topology
Bolts: Processing units that transform, filter, aggregate, or join tuples; special “sink” bolts write results out to external systems
How do hybrid systems like Lambda Architecture combine batch and real-time processing layers to ensure accuracy and responsiveness?
Batch Layer (e.g., Hadoop): Ingests all data to compute a comprehensive, accurate “master” dataset via transforms, joins, validation, and aggregations
Speed Layer (e.g., Storm): Processes real‑time events for low‑latency, incremental updates
Serving Layer (e.g., Druid): Merges outputs from both layers to provide queries that are both up‑to‑date and correct, balancing freshness with accuracy
What are the core components of the Lambda Architecture, and how does it manage failures and state consistency?
Batch Layer: Ingests all raw data and periodically recomputes accurate views to recover from any errors or data loss
Speed (Real‑Time) Layer: Processes incoming events immediately, updating an in‑memory key–value store; must guarantee idempotent updates to avoid duplicates
Serving Layer: Merges batch and speed views into a unified, queryable result for low‑latency access
Failure & Consistency Handling: Batch recomputation corrects any mistakes from the speed layer; stream processors use idempotency and checkpointing of their in‑memory state to ensure exactly‑once semantics
How does the Kappa Architecture differ from Lambda in terms of simplicity and reliance on a single streaming path?
Single Path: Eliminates the separate batch layer—everything is handled as a continuous stream
Event Log Replay: Historical reprocessing achieved by replaying the immutable event log through the same stream processor
Reduced Operational Complexity: One codebase and one processing framework to maintain
State Management: Can leverage micro‑batching within the streaming engine or stateful stream stores, but avoids maintaining dual pipelines
What is the role of Amazon Data Firehose in building serverless ETL pipelines for real-time data delivery?
Ingests streaming data and delivers it automatically to storage or analytics services without needing servers
Manages buffering, encryption, scaling, retries and backups under the hood
Enables near‑real‑time ETL with minimal custom code, lowering operational overhead and storage costs
How does Firehose handle buffering, transformations, and delivery to destinations like S3, Redshift, or OpenSearch?
Buffering: Spans multiple AZs; batches data based on configurable size or time intervals to balance latency vs. throughput
Transformations: Optionally invokes AWS Lambda inline for JSON parsing, format conversion (e.g. Parquet), enrichment or filtering
Delivery: Auto‑scales throughput and writes batch files to S3, loads into Redshift or indexes into OpenSearch, with built‑in retries, backups, and encryption
What are the benefits and use cases of inline data transformations using AWS Lambda in Firehose?
Perform lightweight ETL on the fly: JSON parsing/field extraction, format conversion to Parquet, schema validation
Enrich or filter records before storage, ensuring only clean, compliant data is persisted
Handle transformation errors gracefully by sending faulty records to S3 backup
Reduces downstream storage/query costs and simplifies downstream pipeline logic
What are the key features of Amazon Managed Service for Apache Flink, and how does it support real-time analytics?
Fully managed Flink runtime with event‑time semantics, checkpointing/savepoints, and exactly‑once guarantees
Scalable parallel execution via JobManager (coordinator) and TaskManagers (workers)
Automatic restarts on failures and AWS‑managed scaling/version upgrades
Native support for windowed aggregations, streaming joins, and integrations with S3, DynamoDB, Kinesis, Kafka, etc.
How can Flink be configured for streaming jobs using SQL or APIs, and what are the best practices for deployment and monitoring?
Configuration:
- Deploy via AWS console by uploading a JAR or defining SQL; choose SQL mode for declarative queries or DataStream/Table APIs for custom logic
- Specify parallelism and checkpoint intervals; connect to Kinesis Data Streams or Kafka as source
Best Practices:
- Enable detailed logging and CloudWatch metrics for throughput, latency, and resource usage
- Test in a staging environment under scaled traffic
- Tune checkpointing, watermark strategies, and state TTLs
- Securely rotate credentials (e.g., for Redshift) and partition outputs (e.g., time‑based S3 prefixes)
What problems did earlier streaming frameworks face with duplicate processing and partial updates?
At‑least‑once delivery (e.g. Storm’s acking) led to duplicate records unless sinks implemented manual deduplication.
Partial writes to external ACID stores meant failures could leave inconsistent state.
Developers had to track message IDs or version columns and build Lambda/Kappa layers to reconcile real‑time and batch views
How does Flink use checkpoint barriers and operator state to achieve exactly-once processing?
Checkpoint Barriers: Periodically injected into streams; barriers flow downstream, pausing outputs and buffering incoming data until all inputs align.
Operator State Snapshots: Once aligned, each operator snapshots its local (keyed or operator‑wide) state to durable storage.
Recovery: On failure, Flink restores state from the last completed checkpoint and replays upstream data, ensuring no partial snapshots or duplicate processing .
What is the purpose of two-phase commit for external sinks in Flink, and how does it ensure consistency?
Prepare Phase: During a checkpoint, the sink writes data in a “pending” or transactional state.
Commit Phase: On successful checkpoint completion, the sink finalizes (commits) those writes atomically.
Rollback on Failure: If the checkpoint fails, the sink discards pending writes, preventing partial updates and guaranteeing exactly‑once delivery to external systems
How does Flink unify real-time correctness with recovery strategies, eliminating the need for Lambda/Kappa reconciliation?
Co‑located Compute & State: Operators hold state locally, reducing external write dependencies.
Unified Checkpoint‑Based Snapshots: Same mechanism drives both real‑time processing and historical replays.
Single Pipeline for Live & Backfill: Recovery via checkpoint restore and log replay uses identical code paths, removing separate batch layers and manual merge logic
What are the core features of Apache Flink’s architecture?
Native event‑driven runtime: Processes records as true streams, not micro‑batches
Stateful stream computations: Built‑in keyed and operator state with distributed scaling
Exactly‑once semantics: Checkpointing and two‑phase commits ensure end‑to‑end consistency
JobManager & TaskManagers: Central coordinator (JobManager) and worker nodes (TaskManagers with slots and operator chaining) for parallel execution
Pipelined data exchange: Continuous pipelines for low latency, with fault‑tolerance via coordinated checkpoints/savepoints
How do the DataStream, Table, and SQL APIs support a wide range of streaming use cases in Flink?
DataStream API:
- Unified model for streaming & batch; high‑/low‑level operators (map, filter, flatMap)
- Stateful processing with keyed state, event‑time watermarks, windows, custom logic
- Connectors for Kafka, Kinesis, file systems, etc.
Table API:
- Relational, schema‑based abstraction over DataStream/DataSet
- SQL‑like transformations, time attributes for event‑time grouping, joins, aggregations
- Single codebase for both batch and streaming
SQL API:
- Standard SQL over streaming tables with streaming extensions (OVER windows, temporal joins)
- Integrated event‑time semantics and same optimizer/runtime as Table API
What is Spark Streaming’s discretized stream processing model, and how does it differ from traditional record-at-a-time systems?
Discretized model: chops the live data stream into micro‑batches of fixed interval (e.g., X seconds) and runs each batch as a Spark job.
Traditional model: processes one record at a time, maintaining mutable in‑node state that is lost on failure.
Key difference: Spark Streaming treats each interval as immutable RDDs (resilient batches), avoiding per‑record state loss and enabling fault tolerance via RDD lineage .
How does Spark Streaming use micro-batches (DStreams) to achieve fault-tolerant, near-real-time processing?
DStreams: represent a continuous stream as a sequence of RDDs, one per batch interval.
Processing: each RDD is processed with standard Spark transformations and actions.
Fault tolerance: RDDs are recomputed from lineage on failure, so any lost partition can be rebuilt automatically.
Latency: batch intervals as low as 0.5 s yield end‑to‑end latencies around 1 s .
What is the role of RDDs in Spark Streaming’s computation model?
Each micro‑batch of incoming data is materialized as one or more Resilient Distributed Datasets (RDDs).
RDDs provide:
Immutable, partitioned data for parallel processing
Lineage information for automatic recomputation on failure
Seamless integration with Spark’s core APIs (SQL, MLlib, GraphX)
What types of data sources can Spark Streaming integrate with out-of-the-box (e.g., Kafka, HDFS, TCP sockets)?
Kafka
HDFS
Flume
Akka Actors
Raw TCP sockets
(Plus custom sources via easy‑to‑write receivers)