Week 10: Cloud Based Analytics Flashcards
(40 cards)
What are column stores, and what are some examples of them?
Column stores (a.k.a. column‑oriented systems) store each attribute’s values together in columns rather than whole rows, dramatically reducing I/O for analytical queries.
Examples:
MonetDB
VectorWise (now Actian Vector)
C‑Store (now Vertica)
Sybase IQ
What are some hardware optimizations for columnar storage?
Cache & SIMD
- Column reads bring in contiguous values, boosting CPU cache utilization.
- Vector (SIMD) instructions process many values in one CPU operation.
Compression
- Storing same‑type data together lowers entropy, yielding higher compression ratios.
- Engines can pick per‑column algorithms and automatically compress/decompress.
How do updates work in Columnar Stores?
Updating a single logical row requires touching every column file, so point updates are slow.
Many systems adopt an append‑only or immutable‑blocks approach (e.g., Google Dremel, Redshift).
Redshift: 1 MB blocks are immutable; on update it “clones‑on‑write” a block, so small and large writes incur similar cost.
Explain the Column Store File Format.
Modern column‑oriented file formats include:
Apache Parquet: open‑source Hadoop format used by Hive, Pig, Impala, Spark; stores nested structures in a flat column layout.
Apache ORC: “Optimized Row Columnar” format for Hive.
RCFile, Apache Kudu, ClickHouse.
What motivates modern data warehouse architecture? What technologies enabled it? What are some key features enabled modern data warehouse architecture?
Motivation: Support efficient analytical queries over large, structured datasets for reporting, dashboarding, and long-term retention by minimizing I/O and scaling compute.
Enabling Technologies: Columnar storage formats, MPP (Massively Parallel Processing) engines, and cloud services that abstract hardware setup with optional serverless configurations.
Key Features: Columnar I/O reduction; parallel scanning and aggregation; automated maintenance (e.g., partition optimization); security (encryption at rest, role‑based access, auditing); pay‑per‑query billing based on bytes scanned or query time.
What are some examples of columnar-based data warehouses?
Amazon Redshift: PostgreSQL‑derived, MPP cluster with leader & compute nodes; Spectrum for external data.
Google BigQuery: Fully serverless Dremel engine; pay‑per‑query (bytes processed); external tables over GCS/Drive.
Azure Synapse Analytics: Dedicated SQL Pools (MPP) + serverless SQL; hybrid access via Azure Data Lake Storage.
Snowflake: Multi‑cloud; decoupled compute (virtual warehouses) & storage; proprietary micro‑partitions; time travel.
Others: Teradata, IBM Netezza/Db2 Warehouse, Oracle Exadata—all leveraging MPP and columnar principles.
How do different cloud-based data warehouse solutions like Redshift, BigQuery, Synapse, and Snowflake compare in design and cost model?
Redshift: Cluster‑based MPP (leader + compute nodes); provisioned clusters billed by size and uptime (with optional serverless mode).
BigQuery: Fully serverless; auto‑scales; pay‑per‑query billing based on bytes processed.
Synapse: Dedicated SQL Pools use reserved MPP compute; serverless SQL pools for on‑demand queries; billing by reserved capacity or per‑query.
Snowflake: Decoupled compute (billed per‑second for virtual warehouses) and storage (separate billing); auto‑suspend/resume for cost efficiency.
What are the advantages of serverless and pay-per-query models in cloud warehouses?
Serverless: No dedicated cluster management; automatic scaling of compute; reduced operational overhead.
Pay‑per‑query: Charges based on actual usage (bytes scanned or execution time); avoids idle resource costs; aligns costs with workload
How do data warehouses integrate with broader cloud ecosystems and support advanced analytics like machine learning?
Redshift: Integrates with S3 and AWS Glue Data Catalog; Spectrum for external tables; offers Serverless configs.
BigQuery: External tables on Google Cloud Storage/Drive; BigQuery ML for in‑warehouse model training & inference.
Synapse: Connects to Azure Data Lake Storage; tight integration with Power BI and Azure ML; serverless SQL for ad‑hoc analytics.
Snowflake: Works across AWS/Azure/GCP; secure data sharing; time travel for historical snapshots aiding data science.
What is the role of hybrid and lakehouse architectures in modern analytics?
Hybrid Architectures: Combine dedicated warehouse compute with data lake storage to balance performance and flexibility (e.g., Synapse + Azure Data Lake).
Lakehouse: Unifies warehouse‑style management with data lake agility, supporting BI, batch/stream processing, machine learning, and diverse data types in one platform.
What is the architecture of Amazon Redshift, and how does it support large-scale analytics?
Redshift uses an MPP design with a leader node orchestrating parallel SQL execution across multiple compute nodes that store data in columnar format. This massively parallel processing and columnar I/O reduction enable fast analytical queries over very large datasets.
How is data ingested and optimized in Redshift?
Data is bulk‑loaded via the COPY command from S3 or DynamoDB, with parallel streams splitting data across nodes. It supports CSV, JSON, Parquet, etc. Post‑load, VACUUM reclaims space and reorganizes rows, and ANALYZE updates table statistics to optimize query planning.
What are the key techniques for performance tuning in Redshift?
Distribution keys: Minimize data movement across nodes.
Sort keys: Speed up range scans and aggregations.
Compression encodings: Reduce disk I/O and storage.
VACUUM & ANALYZE: Reorder data and refresh statistics.
Node choice: DC2 for local high‑performance storage; RA3 for separated compute/storage and dynamic scaling.
What are common use cases for Redshift?
Enterprise BI dashboards and scheduled reporting
Customer analytics combining sales, marketing, and CRM data
Forecasting and trend analysis for supply chains and inventory
Regulatory compliance reporting with large data‑retention windows
Ad‑hoc analytics that join structured and semi‑structured feeds in a single SQL platform .
What is Redshift Serverless, and how does it differ from traditional Redshift?
Serverless Redshift auto‑provisions and scales compute resources on demand, eliminating manual cluster sizing. It pauses/resumes automatically, bills via Redshift Processing Units (RPUs) instead of fixed node‑hours, and decouples compute from storage more explicitly.
What are the benefits of Redshift Serverless for cost and scalability?
No idle compute costs: Automatic pause/resume means minimal charges when idle.
Elastic scaling: Handles unpredictable or spiky workloads seamlessly.
Pay‑per‑use billing: Aligns costs with actual query demand, reducing upfront provisioning risks.
How does Redshift Serverless preserve compatibility with classic Redshift?
It uses the same metadata and catalogs, maintains the same table/schema definitions, and preserves MPP parallelism and columnar storage optimizations, so existing workloads run unchanged.
What workloads are best suited for Redshift Serverless?
Ad‑hoc analytics with sporadic load
Occasional finance or marketing dashboards
Development and testing environments
Data science experiments without constant clusters
Event‑driven analytics and proof‑of‑concept BI integrations
What are the key architectural features of Redshift Serverless?
On‑demand orchestration of compute nodes by AWS
Explicit decoupling of compute and storage
Persistent columnar storage with MPP query execution
Unified metadata/catalog with classic Redshift
Transparent auto‑scaling and warm‑up handling
What is Redshift Spectrum, and what problem does it solve?
Spectrum extends Redshift SQL to query “cold” data stored in S3 via external tables—eliminating the need to load infrequently accessed or archival data into the warehouse, while still enabling joins with “hot” warehouse tables.
What are the best practices and trade-offs of Redshift Spectrum?
Offload infrequently accessed data to Spectrum to reduce warehouse storage costs.
Partition your external data (e.g., by date) to limit the amount of S3 scanned and cut query costs.
Choose efficient formats (Parquet or ORC) so Spectrum can skip unneeded columns and rows.
Consider performance: external reads incur higher latency than local scans, so reserve Spectrum for truly cold or archival datasets
What file formats and access methods does Spectrum support?
It reads Parquet, ORC, and text formats (e.g., CSV), using AWS Glue Data Catalog or a Hive Metastore for metadata. External schemas map tables to S3 locations containing these files.
How are internal and external data joined in Spectrum?
After creating an external schema (via Glue or Hive), you define external tables pointing to S3 data. Queries then join these external tables with internal Redshift tables; Spectrum scans only relevant partitions and pushes aggregations into the cluster for final assembly.
What is a data lake, and how does it differ from a traditional data warehouse? What are the key components of the architecture?
Data Lake: Centralized repository for storing massive volumes of raw data in its native form—structured, semi‑structured, and unstructured—all in one location.
Difference: Unlike traditional data warehouses (which require schema‑on‑write and handle mainly structured data), data lakes use schema‑on‑read, accommodate all data types, and scale cheaply to petabytes.
Key Components:
Object Storage: e.g., Azure Data Lake Storage for low‑cost, scalable storage.
Data Movement: e.g., AWS Data Pipeline, Azure Data Factory to ingest and orchestrate data flows.
Schema Discovery & Catalog: e.g., AWS Glue, Azure Data Catalog to crawl, infer schemas, and register metadata.
SQL‑on‑Lake Engines: e.g., Apache Presto, AWS Athena for interactive querying.
Governance & Sharing: e.g., AWS Lake Formation, Azure Data Share for security, access control, and data sharing.