Week 10: Cloud Based Analytics Flashcards

Question

How do services like AWS Glue, Azure Data Catalog, and Apache Atlas support schema discovery and metadata management?

Answer 1

AWS Glue Data Catalog: Crawls data sources (S3, RDS, DynamoDB), extracts metadata, infers schemas, and centralizes them for discovery. AWS Glue ETL: Leverages the catalog to auto‑generate and run Spark‑based ETL jobs, offering built‑in and customizable transformations. Azure Data Catalog: Provides a fully managed registry and search interface for enterprise data assets, enabling users to discover datasets and their schemas. Apache Atlas: Crawls Hadoop‑based data stores to extract metadata, manage data classifications, lineage, and governance policies.

Answer 2

Apache Presto: In‑memory, distributed SQL engine optimized for fast, ad‑hoc analytics and star‑schema joins over large datasets. AWS Athena: Managed, serverless Presto service offering pay‑per‑query SQL access directly over S3. Azure Data Lake Analytics: Serverless U‑SQL engine that scales compute on demand over Azure storage. Apache Spark SQL: SQL interface on Spark for batch and streaming queries on lake data. IBM Cloud SQL Query: Serverless SQL engine for querying data in IBM Cloud Object Storage

Answer 3

AWS Lake Formation: Orchestrates secure data ingestion, applies fine‑grained access controls, and helps curate governed data zones. AWS Glue: Automates discovery with crawlers, builds a central data catalog, and generates/runs Spark ETL jobs serverlessly. Query Automation: AWS Athena, Redshift Spectrum, and EMR (with Zeppelin or EMR Notebooks) automate interactive and batch analytics on lake data. Azure Ecosystem: Azure Data Factory pipelines automate data movement; Azure Data Catalog manages discovery; Azure Data Lake Analytics and other services run on‑demand transformations and analytics.

Answer 4

Athena is a serverless, interactive query service powered by Trino (formerly Presto). It lets you run standard SQL directly on data stored in Amazon S3 without provisioning any infrastructure. You pay only for the amount of data scanned, and results are written back to S3. It uses the Glue Data Catalog (or internal schemas) for table definitions and partitions.

Answer 5

Coordinator‑Worker Model: A coordinator node parses SQL, creates a query plan, and schedules tasks. Worker nodes process data splits in parallel. In‑Place S3 Reads: Workers read data directly from S3 (no data loading), splitting files into “splits” for distributed scanning. Dynamic Scaling: Coordinator and workers are ephemeral, auto‑scaling with query concurrency. Optimizations: Partition pruning skips irrelevant S3 paths, projection pushdown reads only requested columns, and columnar formats (Parquet/ORC) enable efficient block reads.

Answer 6

Fully managed service for data integration, ETL, and metadata cataloging Automates schema detection via Glue Crawlers Stores table definitions, column schemas, data types, and file locations in the Glue Data Catalog Runs serverless Spark‑based ETL jobs (Python/Scala) with auto‑scaling executors Coordinates data flows between S3, Redshift, RDS, and other sources

Answer 7

Glue Crawlers: Sample file headers/content to detect formats (CSV, JSON, Avro, Parquet, ORC); infer schemas; support custom classifiers; auto‑merge evolving schemas; schedule recurring crawls to discover new partitions/columns Glue Data Catalog: Centralized metadata store with versioned table definitions, schemas, partition keys, and S3 URIs; populated by crawlers or manual definitions; enables creation of external tables in Athena and Redshift Spectrum; supports cross‑account/region sharing and IAM‑based access control

Answer 8

Structured Data Processing built on top of Spark’s RDD abstraction with an explicit schema for columns, enabling query optimizations. Data Sources: HDFS files, Hive tables, JSON, JDBC, etc. API Flexibility: Write queries in SQL or switch seamlessly to Python/Java/Scala. Strong Query Engine: Leverages Catalyst optimizer and Tungsten execution for fast, in‑memory analytics.

Answer 9

Dataset: Typed, distributed collection of JVM objects. Combines RDD’s strong typing & lambda functions with Spark SQL’s optimized engine. Supported in Scala & Java (no native Dataset API in Python; Python achieves similar flexibility via dynamic row access). DataFrame: A Dataset organized into named columns (analogous to relational tables or R/Python data frames). Can be built from structured files, Hive tables, external DBs, or existing RDDs. API available in Scala, Java, Python, and R.

Answer 10

Cloud‑based analytics services that operate without dedicated server provisioning, offering on‑demand scaling and pay‑per‑use pricing. Examples include Azure Analysis Services (built on SSAS with tabular models supporting partitions, perspectives, row‑level security, bi‑directional relationships, and translations), AWS Redshift Spectrum, and AWS Athena .

Answer 11

Analytics driven by full‑text search engines and log‑analysis stacks. Key components include the ELK Stack (Elasticsearch, Logstash, Kibana), AWS CloudSearch (Apache Solr‑based, MapReduce‑powered), and Azure Cognitive Search, enabling rapid exploration and visualization of text and log data.

Answer 12

Scalable data processing and analytics using open‑source frameworks and managed services. Covers batch workloads via Hadoop, Spark, and Hive on AWS EMR, Azure HDInsight, and Azure Databricks, as well as real‑time stream processing with AWS Kinesis Data Analytics, Amazon MSK, and Azure Stream Analytics

Answer 13

Interactive visualization platforms for business intelligence and dashboarding. Prominent examples are Tableau, AWS QuickSight, and Azure Power BI, which provide drag‑and‑drop interfaces to explore and present data insights visually

Answer 14

A cloud‑based BI and analytics service from AWS for building and sharing interactive dashboards, visualizations, and reports Integrates natively with AWS sources (Redshift, Athena, S3) Serverless: no clusters to provision or manage; auto‑scales compute to meet user concurrency Underpinned by an in‑memory engine (SPICE) that delivers sub‑second interactions on loaded datasets

Answer 15

SPICE mode Super‑fast, Parallel, In‑memory Calculation Engine Data is imported into SPICE and queried in‑memory for low‑latency dashboards Freshness depends on scheduled refresh jobs Direct Query mode Queries are pushed live to the underlying data source (e.g., Redshift, Athena) Always returns up‑to‑date data Performance is tied to source system capacity and can incur higher latency

Answer 16

ML Insights: auto‑detects outliers, trends, and anomalies in your data Forecasting: projects future metrics based on historical patterns QuickSight Q: natural‑language querying for conversational analytics Embedding APIs: embed dashboards and visuals into external applications or portals

Week 10: Cloud Based Analytics Flashcards

(40 cards)