Week 10: Cloud Based Analytics Flashcards

(40 cards)

1
Q

What are column stores, and what are some examples of them?

A

Column stores (a.k.a. column‑oriented systems) store each attribute’s values together in columns rather than whole rows, dramatically reducing I/O for analytical queries.
Examples:

MonetDB

VectorWise (now Actian Vector)

C‑Store (now Vertica)

Sybase IQ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some hardware optimizations for columnar storage?

A

Cache & SIMD
- Column reads bring in contiguous values, boosting CPU cache utilization.
- Vector (SIMD) instructions process many values in one CPU operation.

Compression
- Storing same‑type data together lowers entropy, yielding higher compression ratios.
- Engines can pick per‑column algorithms and automatically compress/decompress.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do updates work in Columnar Stores?

A

Updating a single logical row requires touching every column file, so point updates are slow.

Many systems adopt an append‑only or immutable‑blocks approach (e.g., Google Dremel, Redshift).

Redshift: 1 MB blocks are immutable; on update it “clones‑on‑write” a block, so small and large writes incur similar cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the Column Store File Format.

A

Modern column‑oriented file formats include:

Apache Parquet: open‑source Hadoop format used by Hive, Pig, Impala, Spark; stores nested structures in a flat column layout.

Apache ORC: “Optimized Row Columnar” format for Hive.

RCFile, Apache Kudu, ClickHouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What motivates modern data warehouse architecture? What technologies enabled it? What are some key features enabled modern data warehouse architecture?

A

Motivation: Support efficient analytical queries over large, structured datasets for reporting, dashboarding, and long-term retention by minimizing I/O and scaling compute.

Enabling Technologies: Columnar storage formats, MPP (Massively Parallel Processing) engines, and cloud services that abstract hardware setup with optional serverless configurations.

Key Features: Columnar I/O reduction; parallel scanning and aggregation; automated maintenance (e.g., partition optimization); security (encryption at rest, role‑based access, auditing); pay‑per‑query billing based on bytes scanned or query time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some examples of columnar-based data warehouses?

A

Amazon Redshift: PostgreSQL‑derived, MPP cluster with leader & compute nodes; Spectrum for external data.

Google BigQuery: Fully serverless Dremel engine; pay‑per‑query (bytes processed); external tables over GCS/Drive.

Azure Synapse Analytics: Dedicated SQL Pools (MPP) + serverless SQL; hybrid access via Azure Data Lake Storage.

Snowflake: Multi‑cloud; decoupled compute (virtual warehouses) & storage; proprietary micro‑partitions; time travel.

Others: Teradata, IBM Netezza/Db2 Warehouse, Oracle Exadata—all leveraging MPP and columnar principles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do different cloud-based data warehouse solutions like Redshift, BigQuery, Synapse, and Snowflake compare in design and cost model?

A

Redshift: Cluster‑based MPP (leader + compute nodes); provisioned clusters billed by size and uptime (with optional serverless mode).

BigQuery: Fully serverless; auto‑scales; pay‑per‑query billing based on bytes processed.

Synapse: Dedicated SQL Pools use reserved MPP compute; serverless SQL pools for on‑demand queries; billing by reserved capacity or per‑query.

Snowflake: Decoupled compute (billed per‑second for virtual warehouses) and storage (separate billing); auto‑suspend/resume for cost efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the advantages of serverless and pay-per-query models in cloud warehouses?

A

Serverless: No dedicated cluster management; automatic scaling of compute; reduced operational overhead.

Pay‑per‑query: Charges based on actual usage (bytes scanned or execution time); avoids idle resource costs; aligns costs with workload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do data warehouses integrate with broader cloud ecosystems and support advanced analytics like machine learning?

A

Redshift: Integrates with S3 and AWS Glue Data Catalog; Spectrum for external tables; offers Serverless configs.

BigQuery: External tables on Google Cloud Storage/Drive; BigQuery ML for in‑warehouse model training & inference.

Synapse: Connects to Azure Data Lake Storage; tight integration with Power BI and Azure ML; serverless SQL for ad‑hoc analytics.

Snowflake: Works across AWS/Azure/GCP; secure data sharing; time travel for historical snapshots aiding data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the role of hybrid and lakehouse architectures in modern analytics?

A

Hybrid Architectures: Combine dedicated warehouse compute with data lake storage to balance performance and flexibility (e.g., Synapse + Azure Data Lake).

Lakehouse: Unifies warehouse‑style manage­ment with data lake agility, supporting BI, batch/stream processing, machine learning, and diverse data types in one platform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the architecture of Amazon Redshift, and how does it support large-scale analytics?

A

Redshift uses an MPP design with a leader node orchestrating parallel SQL execution across multiple compute nodes that store data in columnar format. This massively parallel processing and columnar I/O reduction enable fast analytical queries over very large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is data ingested and optimized in Redshift?

A

Data is bulk‑loaded via the COPY command from S3 or DynamoDB, with parallel streams splitting data across nodes. It supports CSV, JSON, Parquet, etc. Post‑load, VACUUM reclaims space and reorganizes rows, and ANALYZE updates table statistics to optimize query planning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the key techniques for performance tuning in Redshift?

A

Distribution keys: Minimize data movement across nodes.

Sort keys: Speed up range scans and aggregations.

Compression encodings: Reduce disk I/O and storage.

VACUUM & ANALYZE: Reorder data and refresh statistics.

Node choice: DC2 for local high‑performance storage; RA3 for separated compute/storage and dynamic scaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are common use cases for Redshift?

A

Enterprise BI dashboards and scheduled reporting

Customer analytics combining sales, marketing, and CRM data

Forecasting and trend analysis for supply chains and inventory

Regulatory compliance reporting with large data‑retention windows

Ad‑hoc analytics that join structured and semi‑structured feeds in a single SQL platform .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Redshift Serverless, and how does it differ from traditional Redshift?

A

Serverless Redshift auto‑provisions and scales compute resources on demand, eliminating manual cluster sizing. It pauses/resumes automatically, bills via Redshift Processing Units (RPUs) instead of fixed node‑hours, and decouples compute from storage more explicitly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the benefits of Redshift Serverless for cost and scalability?

A

No idle compute costs: Automatic pause/resume means minimal charges when idle.

Elastic scaling: Handles unpredictable or spiky workloads seamlessly.

Pay‑per‑use billing: Aligns costs with actual query demand, reducing upfront provisioning risks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does Redshift Serverless preserve compatibility with classic Redshift?

A

It uses the same metadata and catalogs, maintains the same table/schema definitions, and preserves MPP parallelism and columnar storage optimizations, so existing workloads run unchanged.

18
Q

What workloads are best suited for Redshift Serverless?

A

Ad‑hoc analytics with sporadic load

Occasional finance or marketing dashboards

Development and testing environments

Data science experiments without constant clusters

Event‑driven analytics and proof‑of‑concept BI integrations

19
Q

What are the key architectural features of Redshift Serverless?

A

On‑demand orchestration of compute nodes by AWS

Explicit decoupling of compute and storage

Persistent columnar storage with MPP query execution

Unified metadata/catalog with classic Redshift

Transparent auto‑scaling and warm‑up handling

20
Q

What is Redshift Spectrum, and what problem does it solve?

A

Spectrum extends Redshift SQL to query “cold” data stored in S3 via external tables—eliminating the need to load infrequently accessed or archival data into the warehouse, while still enabling joins with “hot” warehouse tables.

21
Q

What are the best practices and trade-offs of Redshift Spectrum?

A

Offload infrequently accessed data to Spectrum to reduce warehouse storage costs.

Partition your external data (e.g., by date) to limit the amount of S3 scanned and cut query costs.

Choose efficient formats (Parquet or ORC) so Spectrum can skip unneeded columns and rows.

Consider performance: external reads incur higher latency than local scans, so reserve Spectrum for truly cold or archival datasets

22
Q

What file formats and access methods does Spectrum support?

A

It reads Parquet, ORC, and text formats (e.g., CSV), using AWS Glue Data Catalog or a Hive Metastore for metadata. External schemas map tables to S3 locations containing these files.

23
Q

How are internal and external data joined in Spectrum?

A

After creating an external schema (via Glue or Hive), you define external tables pointing to S3 data. Queries then join these external tables with internal Redshift tables; Spectrum scans only relevant partitions and pushes aggregations into the cluster for final assembly.

24
Q

What is a data lake, and how does it differ from a traditional data warehouse? What are the key components of the architecture?

A

Data Lake: Centralized repository for storing massive volumes of raw data in its native form—structured, semi‑structured, and unstructured—all in one location.

Difference: Unlike traditional data warehouses (which require schema‑on‑write and handle mainly structured data), data lakes use schema‑on‑read, accommodate all data types, and scale cheaply to petabytes.

Key Components:

Object Storage: e.g., Azure Data Lake Storage for low‑cost, scalable storage.

Data Movement: e.g., AWS Data Pipeline, Azure Data Factory to ingest and orchestrate data flows.

Schema Discovery & Catalog: e.g., AWS Glue, Azure Data Catalog to crawl, infer schemas, and register metadata.

SQL‑on‑Lake Engines: e.g., Apache Presto, AWS Athena for interactive querying.

Governance & Sharing: e.g., AWS Lake Formation, Azure Data Share for security, access control, and data sharing.

25
How do services like AWS Glue, Azure Data Catalog, and Apache Atlas support schema discovery and metadata management?
AWS Glue Data Catalog: Crawls data sources (S3, RDS, DynamoDB), extracts metadata, infers schemas, and centralizes them for discovery. AWS Glue ETL: Leverages the catalog to auto‑generate and run Spark‑based ETL jobs, offering built‑in and customizable transformations. Azure Data Catalog: Provides a fully managed registry and search interface for enterprise data assets, enabling users to discover datasets and their schemas. Apache Atlas: Crawls Hadoop‑based data stores to extract metadata, manage data classifications, lineage, and governance policies.
26
What tools and engines are used to run SQL queries directly on data lakes without needing intermediary databases?
Apache Presto: In‑memory, distributed SQL engine optimized for fast, ad‑hoc analytics and star‑schema joins over large datasets. AWS Athena: Managed, serverless Presto service offering pay‑per‑query SQL access directly over S3. Azure Data Lake Analytics: Serverless U‑SQL engine that scales compute on demand over Azure storage. Apache Spark SQL: SQL interface on Spark for batch and streaming queries on lake data. IBM Cloud SQL Query: Serverless SQL engine for querying data in IBM Cloud Object Storage
27
How do cloud services automate data ingestion, discovery, transformation (ETL), and analytics in a data lake environment?
AWS Lake Formation: Orchestrates secure data ingestion, applies fine‑grained access controls, and helps curate governed data zones. AWS Glue: Automates discovery with crawlers, builds a central data catalog, and generates/runs Spark ETL jobs serverlessly. Query Automation: AWS Athena, Redshift Spectrum, and EMR (with Zeppelin or EMR Notebooks) automate interactive and batch analytics on lake data. Azure Ecosystem: Azure Data Factory pipelines automate data movement; Azure Data Catalog manages discovery; Azure Data Lake Analytics and other services run on‑demand transformations and analytics.
28
What is AWS Athena, and how does it enable serverless SQL querying directly on Amazon S3 data?
Athena is a serverless, interactive query service powered by Trino (formerly Presto). It lets you run standard SQL directly on data stored in Amazon S3 without provisioning any infrastructure. You pay only for the amount of data scanned, and results are written back to S3. It uses the Glue Data Catalog (or internal schemas) for table definitions and partitions.
29
How does Athena’s architecture leverage Presto/Trino to achieve parallel and efficient query execution?
Coordinator‑Worker Model: A coordinator node parses SQL, creates a query plan, and schedules tasks. Worker nodes process data splits in parallel. In‑Place S3 Reads: Workers read data directly from S3 (no data loading), splitting files into “splits” for distributed scanning. Dynamic Scaling: Coordinator and workers are ephemeral, auto‑scaling with query concurrency. Optimizations: Partition pruning skips irrelevant S3 paths, projection pushdown reads only requested columns, and columnar formats (Parquet/ORC) enable efficient block reads.
30
What is AWS Glue, and how does it support data integration, cataloging, and serverless ETL pipelines?
Fully managed service for data integration, ETL, and metadata cataloging Automates schema detection via Glue Crawlers Stores table definitions, column schemas, data types, and file locations in the Glue Data Catalog Runs serverless Spark‑based ETL jobs (Python/Scala) with auto‑scaling executors Coordinates data flows between S3, Redshift, RDS, and other sources
31
How do Glue Crawlers and the Glue Data Catalog automate schema discovery and metadata management across services like Athena and Redshift Spectrum?
Glue Crawlers: Sample file headers/content to detect formats (CSV, JSON, Avro, Parquet, ORC); infer schemas; support custom classifiers; auto‑merge evolving schemas; schedule recurring crawls to discover new partitions/columns Glue Data Catalog: Centralized metadata store with versioned table definitions, schemas, partition keys, and S3 URIs; populated by crawlers or manual definitions; enables creation of external tables in Athena and Redshift Spectrum; supports cross‑account/region sharing and IAM‑based access control
32
What is Spark SQL? What are its main properties and use cases?
Structured Data Processing built on top of Spark’s RDD abstraction with an explicit schema for columns, enabling query optimizations. Data Sources: HDFS files, Hive tables, JSON, JDBC, etc. API Flexibility: Write queries in SQL or switch seamlessly to Python/Java/Scala. Strong Query Engine: Leverages Catalyst optimizer and Tungsten execution for fast, in‑memory analytics.
33
What are the differences and use cases for Datasets and DataFrames in Spark, and how do they support different languages?
Dataset: Typed, distributed collection of JVM objects. Combines RDD’s strong typing & lambda functions with Spark SQL’s optimized engine. Supported in Scala & Java (no native Dataset API in Python; Python achieves similar flexibility via dynamic row access). DataFrame: A Dataset organized into named columns (analogous to relational tables or R/Python data frames). Can be built from structured files, Hive tables, external DBs, or existing RDDs. API available in Scala, Java, Python, and R.
34
Serverless Analytics
Cloud‑based analytics services that operate without dedicated server provisioning, offering on‑demand scaling and pay‑per‑use pricing. Examples include Azure Analysis Services (built on SSAS with tabular models supporting partitions, perspectives, row‑level security, bi‑directional relationships, and translations), AWS Redshift Spectrum, and AWS Athena .
35
Search-based Analytics
Analytics driven by full‑text search engines and log‑analysis stacks. Key components include the ELK Stack (Elasticsearch, Logstash, Kibana), AWS CloudSearch (Apache Solr‑based, MapReduce‑powered), and Azure Cognitive Search, enabling rapid exploration and visualization of text and log data.
36
Big Data Analytics
Scalable data processing and analytics using open‑source frameworks and managed services. Covers batch workloads via Hadoop, Spark, and Hive on AWS EMR, Azure HDInsight, and Azure Databricks, as well as real‑time stream processing with AWS Kinesis Data Analytics, Amazon MSK, and Azure Stream Analytics
37
Graphical BI Tools
Interactive visualization platforms for business intelligence and dashboarding. Prominent examples are Tableau, AWS QuickSight, and Azure Power BI, which provide drag‑and‑drop interfaces to explore and present data insights visually
38
What is Amazon QuickSight, and how does its serverless architecture support interactive dashboards and visual analytics?
A cloud‑based BI and analytics service from AWS for building and sharing interactive dashboards, visualizations, and reports Integrates natively with AWS sources (Redshift, Athena, S3) Serverless: no clusters to provision or manage; auto‑scales compute to meet user concurrency Underpinned by an in‑memory engine (SPICE) that delivers sub‑second interactions on loaded datasets
39
What are the differences between SPICE and Direct Query modes, and how do they impact performance and data freshness?
SPICE mode Super‑fast, Parallel, In‑memory Calculation Engine Data is imported into SPICE and queried in‑memory for low‑latency dashboards Freshness depends on scheduled refresh jobs Direct Query mode Queries are pushed live to the underlying data source (e.g., Redshift, Athena) Always returns up‑to‑date data Performance is tied to source system capacity and can incur higher latency
40
How do QuickSight’s advanced features enhance business intelligence capabilities?
ML Insights: auto‑detects outliers, trends, and anomalies in your data Forecasting: projects future metrics based on historical patterns QuickSight Q: natural‑language querying for conversational analytics Embedding APIs: embed dashboards and visuals into external applications or portals