Data Engineering - ML data repositories compared Flashcards

1
Q

What are the three characteristics of storage most relevant to ML?

A
  1. Cost
  2. Availability -
  3. Usability - can the preferred ML + preprocessing tools access the storage + how quickly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does availability mean in relation to storage?

A

how long does it take data to be ready for processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Usability mean in relatiom

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which 4 repositories can SageMaker accept data from?

A
  1. S3
  2. Amazon EFS
  3. Amazon FSx for Lustre
  4. EBS Volumes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe S3

A

S3 is an object data repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are files stored in S3

A

Files are stored as single objects identified using a key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 4 advantages of S3?

A

Highly scalable, available, durable and low cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name the two steps of the S3 lifecycle

A
  1. Transition
  2. Expiration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the transition phase in the S3 lifecycle?

A

process of moving datasets through storage classes with different characteristics. Normally from highly available (S3) to cheaper storage as it gets older (S3 Glacier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the expiration phase in the S3 lifecycle?

A

Data is deleted after a certain period. Important for regulatory requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the order of S3 repositories during the transition phase?

A
  1. S3 - regular access, highly available
  2. S3 IA - Infrequent access, low value or easily recreated data
  3. Glacier + Glacier deep archive - long-term low-cost archiving
  4. Expire - delete data no longer needed or required by regulators.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which S3 would you use for general purpose, regular access?

A

S3 standard - for data that is regularly required and needs to be accessed instantly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which S3 would you use for unknown or changing access?

A

S3 Intelligent-Tiering - for data accesses in an unpredictable way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does S3 Intelligent-tiering do?

A

It will automatically move data between instant access to longer term storage depending on when the data is accessed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which S3 would you use for infrequent access?

A

S3 Standard-IA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which S3 would you use for archiving data?

A

AWS S3 Glacier + S3 Glacier Deep - long-term low-cost archiving of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain the usage of AWS lake formation

A

Used to rapidly set up a data lake with S3 as the data repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What type of data can AWS lake formation store?

A

structured + unstructured data at scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is AWS Lake formation built on top of?

A

AWS Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the steps during the setup of Lake Formation?

A
  1. Find the input data sources
  2. Setup the S3 data lake
  3. Move the data to the S3 lake
  4. Crawl the data to determine its structure and build a data catalogue
  5. Perform ETL
  6. Setup security to protect the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the FSx for Lustre Storage

A

A high-performance combination of S3 and SSD storage. Data is presented as files to the ML models so processing can start immediately without having to wait for S3 to load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Give the five features of FSx lustre

A
  • high performance storage system
  • low latency
  • high throughput
  • high IOPS
  • multiple underlying storage types
23
Q

Explain the Machine Learning use case of Amazon FSx for Lustre

A

For serving massive training data to SageMaker. The file store is concurrent so multiple computer instances can work on the data at the same time, It integrates with SageMaker.

24
Q

Describe EBS Volumes

A

A virtual version of your computer’s hardrive. Data is stored as files and fast access can be specified. The data can be backed up using snapshot and its possible to setup RAID configurations.

25
Q

What are instances created by SageMaker for SageMaker notebooks?

A

EC2 instances with EBS volumes

26
Q

Describe EFS

A

the networked drive version of EBS. It has multiple EBS drives networked together so that the data can be accessed by multiple compute instances

27
Q

name the different verions of EFS

A

Standard EFS and EFS IA (infrequent access)

28
Q

Name the secondary data repositories in AWS?

A

RDS, DynamoDB, Redshift, Redshift Spectrum, Timestream, DocumentDB

29
Q

Can a secondary data repository be directly ingested by SageMaker?

A

no, it has to be moved to another repository for example S3

30
Q

Describe RDS

A

Amazon Relational Database Service - makes it easy to setup, operate and scale relational databases. AWS takes care of most of the admin and maintenance

31
Q

Which databases can RDS supply?

A

Open Sources (mySQL, PostgreSQL) and vendor owned ( Oracle, Microsoft)

32
Q

Name four use cases of RDS

A
  • Data that is relational and structured
  • Data Warehouse
  • online transaction processing
  • Running relational joins and complex updates
33
Q

Describe DynamoDB

A

a no-SQL database where data is stored as key-value pairs. It treats data within it as being composed as a list of attributes and values.

34
Q

Name the use cases of DynamoDB

A
  • non-relational database
  • Structured and less structured data
  • Storing JSON objects
35
Q

Describe RedShift

A

a fast massively scale-able data warehouse system. It stores structured data that can be accessed and manipulated by standard SQL

36
Q

List the 3 use cases of RedShift

A
  • Data Warehouse
  • Structured relational data
  • complex analytical queries
37
Q

Name and describe two useful commands for Redshift

A

UNLOAD
COPY - used to take data from an S3 bucket and place it in a RedShift table

38
Q

Describe RedShift Spectrum

A

can be used for ad-hoc ETL. It can use the catalogue to access raw data files in S3 using standard SQL queries to clean and transform the data structure.

39
Q

Give the use cases of RedShift Spectrum

A
  • Data Lake
  • Semi-structured data
40
Q

Describe Amazin Timestream

A

A serverless database for storing time series ie log data or IoT devices. Data is stored and queried by time intervals. Accessing data is very fast.

41
Q

What would you use Amazon TimeStream for?

A

To identify trends, patterns, and anomalies in time series data

42
Q

Describe DocumentDB

A

A repository optimised for storing and querying JSON documents. It is an Apache MongoDB hosted on AWS infrastructure and marketed by AWS as a way to migrate existing MongoDB instances on to AWS serverless infrastructure.

43
Q

State the use cases of DocumentDB

A
  • Used to migrate MongoDB to AWS
  • Store JSON documents
  • Non-relational data and less structured data
44
Q

Why is data in EBS and EFS so quickly available?

A

Because the data can be used directly without being moved

45
Q

Which repositories support all types of data structures? (structured, semi-structured and unstructured)

A

S3, FSx for Lustre, EDS and EFS

46
Q

Which repositories only support structured data?

A

RDS, RedShift and Timestream (schemeless)

47
Q

Which repositories support only semi-structured data?

A

RedShift Spectrum

48
Q

Which repositories support semi-structured + structured but not unstructured data?

A

Dynamodb + Documentdb

49
Q

Which repositiories only support structured and unstructured data?

A

LakeFormation

50
Q

What is the maximum amount of data an S3 bucket can hold?

A

unlimited

51
Q

What datastires are suitable for structured data?

A

Any RDS database ie ORACLE, Microsoft, MySQL and PostgreSQL

52
Q

What types of data can DynamoDB support?

A

Data can be both structured and semi-structured

53
Q
A
54
Q

What does UNLOAD do?

A

used to save a table to a set of files on S3