Analytics Flashcards
What is AWS Glue?
A serverless ETL service for discovering, preparing, and combining data for analytics, machine learning, and application development.
What are the key components of AWS Glue?
Data Catalog, Crawler, ETL Engine, Studio, Data Quality, DataBrew, Workflows
What is the purpose of the Glue Data Catalog?
Stores table definitions and schemas for data located in various sources like S3, RDS, and DynamoDB.
What does a Glue Crawler do?
Scans data stores to infer schemas and create table definitions in the Data Catalog.
What programming languages are supported for Glue ETL jobs?
Scala and Python
What is a DynamicFrame in AWS Glue?
A collection of DynamicRecords with a schema, similar to a Spark DataFrame but with additional ETL capabilities.
What are some common transformations in Glue ETL?
DropFields, DropNullFields, Filter, Join, Map, FindMatches ML, format conversions
How can you modify the Data Catalog using Glue ETL scripts?
By using options like enableUpdateCatalog
, partitionKeys
, and updateBehavior
to add partitions, update schemas, or create new tables.
What are AWS Glue Development Endpoints?
Notebook-based environments for developing and testing Glue ETL scripts.
What are Glue Job Bookmarks used for?
Persisting state from a job run to prevent reprocessing of old data and ensure incremental processing.
How are AWS Glue costs calculated?
Primarily billed by the second for crawler and ETL jobs, with additional charges for Data Catalog storage beyond the free tier and for development endpoints.
What is AWS Glue Studio?
A visual interface for creating and managing ETL workflows without writing code.
What is the purpose of AWS Glue Data Quality?
To define and enforce data quality rules within Glue jobs using DQDL.
What is AWS Glue DataBrew?
A visual data preparation tool with over 250 pre-built transformations for cleaning and normalizing data.
What is AWS Lake Formation?
A service that simplifies the setup and management of secure data lakes.
What are Governed Tables in AWS Lake Formation?
A new type of S3 table that supports ACID transactions and fine-grained access control.
What is Amazon Athena?
A serverless interactive query service that allows you to analyze data in S3 using standard SQL.
What data formats are supported by Athena?
CSV, TSV, JSON, ORC, Parquet, Avro, and various compression formats.
How is Amazon Athena priced?
Pay-as-you-go, based on the amount of data scanned by queries.
What are Athena Workgroups?
A way to organize users, teams, and workloads in Athena, allowing for cost tracking and query access control.
How can you optimize Athena query performance?
Use columnar data formats, partition data effectively, and minimize the number of small files.
What is the CREATE TABLE AS SELECT
(CTAS) statement used for in Athena?
Creating a new table from the results of a query, including data format conversion.
What is Apache Spark?
A distributed processing framework for big data processing, known for its in-memory caching and optimized query execution.
What are the core components of Apache Spark?
Spark Core, Spark SQL, Spark Streaming, MLLib, GraphX