[MLS] Data Engineering Flashcards

Question 1

Q

What is the maximum object size in Amazon S3?

Question 2

Q

What are the key characteristics of S3 as a data lake?

Answer

A

It has ‘infinite size’, doesn’t require provisioning, decouples storage from compute, is centralized, and supports any file format

Question 3

Q

Why is S3 data partitioning used?

Answer

A

To speed up queries of a range. Can be partitioned by date, product, or other useful criteria

Question 4

Q

What are the main S3 storage classes?

Answer

A

S3 Standard, S3 Standard IA, S3 One Zone IA, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, and S3 Intelligent Tiering

Question 5

Q

What is the durability and availability standard for S3?

Answer

A

Durability is 11 9s (99.999999999%), Availability is 99.99% (approximately 53 minutes downtime per year)

Question 6

Q

What are the main types of S3 access controls?

Answer

A

Bucket policies (resource-based, bucket-wide), Object ACLs (fine-grain control), and Bucket ACLs (less common)

Question 7

Q

What is the primary function of Kinesis Data Firehose?

Answer

A

To ingest massive amounts of data in near-real-time

Question 8

Q

What are the main capabilities of Managed Service for Apache Flink?

Answer

A

Automatic provisioning, parallelization, automatic scaling, real-time data processing, streaming ETL

Question 9

Q

What are the key components of Kinesis Video Streams?

Answer

A

Supports one producer per video stream, provides video playback, can have consumers like EC2, containers, SageMaker, and Rekognition

Question 10

Q

What can Glue Crawlers do?

Answer

A

Infer schemas and partitions from data, work with various formats (JSON, Parquet, CSV) and sources (S3, Redshift, RDS)

Question 11

Q

What programming languages does Glue ETL support?

Answer

A

Python and Scala

Question 12

Q

What are some example transformations available in Glue?

Answer

A

Filter, Join, Map, DropFields, DropNull Fields, FindMatches ML, format conversions

Question 13

Q

What is AWS Glue DataBrew?

Answer

A

A tool to clean and normalize data without coding, using over 250 pre-made transformations

Question 14

Q

What are the main data stores used in ML on AWS?

Answer

A

Redshift, RDS, DynamoDB, S3, OpenSearch, ElastiCache

Question 15

Q

How does AWS Data Pipeline differ from AWS Glue?

Answer

A

Data Pipeline is an orchestrator using EC2 instances in your account, while Glue focuses on ETL using Spark and manages the resources for you

Question 16

Q

What is AWS Batch used for?

Answer

Study These Flashcards

A

Running batch jobs from Docker images, serverless, can work with spot instances, orchestrated with step functions

Question 17

Q

What are the key features of AWS Database Migration Service?

Answer

Study These Flashcards

A

Supports homogeneous and heterogeneous migrations, keeps source database available during migration, provides continuous data replication, runs on EC2