Cloud Storage Flashcards
(38 cards)
3 Main Categories (steps) to the AWS Data Pipeline
- Ingest (Gateway)
- Transform and store (S3)
- Serve and consume (EMR)
Google’s version of AWS 3 step pipeline
- Ingest
- Analyze
- Serve
(similar, but more generic used in professor example)
Supporting services that supplement Ingestion, Analytics and Serve (2)
- Storage
- Security
Computing, networking
DIKW pyramid
Data -> Information -> Knowledge -> Wisdom
3 tiers of data structure
Structured
Semi-structured
unstructured
3 levels of data abstraction
Block level (EC2)
file level (S3)
database mode (Amazon RDS)
data access (models)
NoSQL and the 4 types
relational database (Amazon RDS)
AWS S3
Simple storage service, place where you can store all types of data decoupled from processing, enabling a multi-user setup so different users can bring their own data while maintaining isolation and access control.
Object - file and metdata
bucket - logical containers for objects
(can configure access to buckets, geographical region for bucket)
AWS S3 storage classes are like bank/investment accounts, why?
Different classes of access frequency, from S3 Standard (frequent) to S3 Deep Glacier (1 or 2 times a year!)
Its like how a bank offers better interest for accounts that are not accessed!
Google Cloud Storage Classes (4)
- Standard (frequent)
- Nearline (Monthly)
- Coldline (Yearly)
- Archive (least frequent)
Why is AWS better?
Offers intelligent-tiering class that automatically shifts data based on access patterns
AWS Lifecycle Configuration
Set of rules that define actions that S3 will apply to a group of objects
Action types:
Transition - moving from one storage class to another (glacier - deep glacier)
Expiration - when to delete S3 objects
Data Pipelines
Automated workflows that move and process data from one system to another, cleaning, transforming and enriching the data along the way
Landing Area (LA) Data Lake
Where raw data is served from ingestion
Staging Area (SA) Data Lake
Place where Raw data goes after basic quality transformations, ensuring that it conforms to existing schemas
Archive Area (A)
Stores the original raw data for future reference, debugging, or reprocessing
Production Area (PA) Data Lake
Apply business logic to data from Staging Area (SA)
Aggregating: Summarizing sales by store, region or product
Business specific calculations (profit margin)
Pass-through job (Optional) Data Lake
Copy of data from Staging Area (SA) is passed directly to Cloud Data Warehouse without business logic.
For comparison and debugging
Failed Area (FA) Data Lake
Captures data that encounters issues such as bugs in pipeline code or cloud resources failures to deal with errors
4 folders used to organize data in logical structure (hint: file directory)
Namespace - group pipelines together
Pipeline Name - reflecting purpose, for example pipeline that takes data from the LA applies processing steps and saves to SA could be one pipeline
Data Source name - assigned by ingestion layer
BatchID - unique identifier for any batch of data saved into LA
Namespace
Groups multiple pipelines or areas logically, such as “landing,” “staging,” “archive,”
Pipeline Name
Each pipeline gets a name reflecting its purpose (sales_oracle_ingest)
Data Source Name
The specific data source, such as customer data
Batch ID
Unique identifier that is assigned to each batch of ingested data into the landing area