Data Engineering - ML data repositories compared Flashcards by Annette Reid

What are the three characteristics of storage most relevant to ML?

Cost
Availability -
Usability - can the preferred ML + preprocessing tools access the storage + how quickly

How well did you know this?

Not at all

Perfectly

What does availability mean in relation to storage?

how long does it take data to be ready for processing

How well did you know this?

Not at all

Perfectly

What does Usability mean in relatiom

How well did you know this?

Not at all

Perfectly

Which 4 repositories can SageMaker accept data from?

S3
Amazon EFS
Amazon FSx for Lustre
EBS Volumes

How well did you know this?

Not at all

Perfectly

Describe S3

S3 is an object data repository.

How well did you know this?

Not at all

Perfectly

How are files stored in S3

Files are stored as single objects identified using a key

How well did you know this?

Not at all

Perfectly

What are the 4 advantages of S3?

Highly scalable, available, durable and low cost

How well did you know this?

Not at all

Perfectly

Name the two steps of the S3 lifecycle

Transition
Expiration

How well did you know this?

Not at all

Perfectly

What is the transition phase in the S3 lifecycle?

process of moving datasets through storage classes with different characteristics. Normally from highly available (S3) to cheaper storage as it gets older (S3 Glacier)

How well did you know this?

Not at all

Perfectly

What is the expiration phase in the S3 lifecycle?

Data is deleted after a certain period. Important for regulatory requirements.

How well did you know this?

Not at all

Perfectly

What is the order of S3 repositories during the transition phase?

S3 - regular access, highly available
S3 IA - Infrequent access, low value or easily recreated data
Glacier + Glacier deep archive - long-term low-cost archiving
Expire - delete data no longer needed or required by regulators.

How well did you know this?

Not at all

Perfectly

Which S3 would you use for general purpose, regular access?

S3 standard - for data that is regularly required and needs to be accessed instantly.

How well did you know this?

Not at all

Perfectly

Which S3 would you use for unknown or changing access?

S3 Intelligent-Tiering - for data accesses in an unpredictable way.

How well did you know this?

Not at all

Perfectly

What does S3 Intelligent-tiering do?

It will automatically move data between instant access to longer term storage depending on when the data is accessed.

How well did you know this?

Not at all

Perfectly

Which S3 would you use for infrequent access?

S3 Standard-IA

How well did you know this?

Not at all

Perfectly

Which S3 would you use for archiving data?

AWS S3 Glacier + S3 Glacier Deep - long-term low-cost archiving of data

How well did you know this?

Not at all

Perfectly

Explain the usage of AWS lake formation

Used to rapidly set up a data lake with S3 as the data repository.

How well did you know this?

Not at all

Perfectly

What type of data can AWS lake formation store?

structured + unstructured data at scale.

How well did you know this?

Not at all

Perfectly

What is AWS Lake formation built on top of?

AWS Glue

How well did you know this?

Not at all

Perfectly

What are the steps during the setup of Lake Formation?

Find the input data sources
Setup the S3 data lake
Move the data to the S3 lake
Crawl the data to determine its structure and build a data catalogue
Perform ETL
Setup security to protect the data

How well did you know this?

Not at all

Perfectly

Describe the FSx for Lustre Storage

A high-performance combination of S3 and SSD storage. Data is presented as files to the ML models so processing can start immediately without having to wait for S3 to load.

How well did you know this?

Not at all

Perfectly

Give the five features of FSx lustre

Study These Flashcards

high performance storage system
low latency
high throughput
high IOPS
multiple underlying storage types

Explain the Machine Learning use case of Amazon FSx for Lustre

Study These Flashcards

For serving massive training data to SageMaker. The file store is concurrent so multiple computer instances can work on the data at the same time, It integrates with SageMaker.

Describe EBS Volumes

Study These Flashcards

A virtual version of your computer’s hardrive. Data is stored as files and fast access can be specified. The data can be backed up using snapshot and its possible to setup RAID configurations.

What are instances created by SageMaker for SageMaker notebooks?

EC2 instances with EBS volumes

Describe EFS

the networked drive version of EBS. It has multiple EBS drives networked together so that the data can be accessed by multiple compute instances

name the different verions of EFS

Standard EFS and EFS IA (infrequent access)

Name the secondary data repositories in AWS?

RDS, DynamoDB, Redshift, Redshift Spectrum, Timestream, DocumentDB

Can a secondary data repository be directly ingested by SageMaker?

no, it has to be moved to another repository for example S3

Describe RDS

Amazon Relational Database Service - makes it easy to setup, operate and scale relational databases. AWS takes care of most of the admin and maintenance

Which databases can RDS supply?

Open Sources (mySQL, PostgreSQL) and vendor owned ( Oracle, Microsoft)

Name four use cases of RDS

- Data that is relational and structured - Data Warehouse - online transaction processing - Running relational joins and complex updates

Describe DynamoDB

a no-SQL database where data is stored as key-value pairs. It treats data within it as being composed as a list of attributes and values.

Name the use cases of DynamoDB

- non-relational database - Structured and less structured data - Storing JSON objects

Describe RedShift

a fast massively scale-able data warehouse system. It stores structured data that can be accessed and manipulated by standard SQL

List the 3 use cases of RedShift

- Data Warehouse - Structured relational data - complex analytical queries

Name and describe two useful commands for Redshift

UNLOAD COPY - used to take data from an S3 bucket and place it in a RedShift table

Describe RedShift Spectrum

can be used for ad-hoc ETL. It can use the catalogue to access raw data files in S3 using standard SQL queries to clean and transform the data structure.

Give the use cases of RedShift Spectrum

- Data Lake - Semi-structured data

Describe Amazin Timestream

A serverless database for storing time series ie log data or IoT devices. Data is stored and queried by time intervals. Accessing data is very fast.

What would you use Amazon TimeStream for?

To identify trends, patterns, and anomalies in time series data

Describe DocumentDB

A repository optimised for storing and querying JSON documents. It is an Apache MongoDB hosted on AWS infrastructure and marketed by AWS as a way to migrate existing MongoDB instances on to AWS serverless infrastructure.

State the use cases of DocumentDB

- Used to migrate MongoDB to AWS - Store JSON documents - Non-relational data and less structured data

Why is data in EBS and EFS so quickly available?

Because the data can be used directly without being moved

Which repositories support all types of data structures? (structured, semi-structured and unstructured)

S3, FSx for Lustre, EDS and EFS

Which repositories only support structured data?

RDS, RedShift and Timestream (schemeless)

Which repositories support only semi-structured data?

RedShift Spectrum

Which repositories support semi-structured + structured but not unstructured data?

Dynamodb + Documentdb

Which repositiories only support structured and unstructured data?

LakeFormation

What is the maximum amount of data an S3 bucket can hold?

unlimited

What datastires are suitable for structured data?

Any RDS database ie ORACLE, Microsoft, MySQL and PostgreSQL

What types of data can DynamoDB support?

Data can be both structured and semi-structured

What does UNLOAD do?

used to save a table to a set of files on S3

Data Engineering - ML data repositories compared Flashcards

(54 cards)