[MLS] Data Engineering Flashcards

(21 cards)

1
Q

What is the maximum object size in Amazon S3?

A

5TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key characteristics of S3 as a data lake?

A

It has ‘infinite size’, doesn’t require provisioning, decouples storage from compute, is centralized, and supports any file format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is S3 data partitioning used?

A

To speed up queries of a range. Can be partitioned by date, product, or other useful criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the main S3 storage classes?

A

S3 Standard, S3 Standard IA, S3 One Zone IA, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, and S3 Intelligent Tiering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the durability and availability standard for S3?

A

Durability is 11 9s (99.999999999%), Availability is 99.99% (approximately 53 minutes downtime per year)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the main types of S3 access controls?

A

Bucket policies (resource-based, bucket-wide), Object ACLs (fine-grain control), and Bucket ACLs (less common)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the primary function of Kinesis Data Firehose?

A

To ingest massive amounts of data in near-real-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the main capabilities of Managed Service for Apache Flink?

A

Automatic provisioning, parallelization, automatic scaling, real-time data processing, streaming ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the key components of Kinesis Video Streams?

A

Supports one producer per video stream, provides video playback, can have consumers like EC2, containers, SageMaker, and Rekognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can Glue Crawlers do?

A

Infer schemas and partitions from data, work with various formats (JSON, Parquet, CSV) and sources (S3, Redshift, RDS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What programming languages does Glue ETL support?

A

Python and Scala

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some example transformations available in Glue?

A

Filter, Join, Map, DropFields, DropNull Fields, FindMatches ML, format conversions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is AWS Glue DataBrew?

A

A tool to clean and normalize data without coding, using over 250 pre-made transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the main data stores used in ML on AWS?

A

Redshift, RDS, DynamoDB, S3, OpenSearch, ElastiCache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does AWS Data Pipeline differ from AWS Glue?

A

Data Pipeline is an orchestrator using EC2 instances in your account, while Glue focuses on ETL using Spark and manages the resources for you

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is AWS Batch used for?

A

Running batch jobs from Docker images, serverless, can work with spot instances, orchestrated with step functions

17
Q

What are the key features of AWS Database Migration Service?

A

Supports homogeneous and heterogeneous migrations, keeps source database available during migration, provides continuous data replication, runs on EC2

18
Q

What is the maximum execution time for AWS Step Functions?

19
Q

What is MQTT used for in AWS?

A

It’s an IoT protocol used for messaging, particularly for getting sensor data into ML models

20
Q

What does AWS DataSync require for on-premises connections?

A

A DataSync Agent deployed as a VM to connect to internal storage for replication

21
Q

What is AWS Data Pipeline?

A

A service to move data from one place to another, just an orchestrator that retries and notifies on failures