ML Fundamentals Flashcards
allows people to store objects (files) in “buckets”
(directories)
Amazon S3
What pathway is this called: * <my_bucket>/my_folder1/another_folder/my_file.txt</my_bucket>
S3 Bucket Key
- Pattern for speeding up range queries (ex: AWS Athena)
- By Date: s3://bucket/my-dataset/year/month/day/hour/data_00.csv
- By Product: s3://bucket/my-data-set/product-id/data_32.csv
Amazon S3 Data Partitioning
Durability or availability:
* If you store 10,000,000 objects with Amazon S3, you can on average
expect to incur a loss of a single object once every 10,000 years
* Same for all storage classes
Durability
Durability or availability:
* Measures how readily available a service is
* Varies depending on storage class
Availability
What S3 storage class is the below:
* 99.99% Availability
* Used for frequently accessed data
* Low latency and high throughput
* Sustain 2 concurrent facility failures
* Use Cases: Big Data analytics, mobile & gaming applications,
content distribution…
S3 Standard – General Purpose
What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
** 99.9% Availability
* Use cases: Disaster Recovery, backups
- Amazon S3 Standard-Infrequent Access (S3 Standard-IA)
What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
* High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed
* 99.5% Availability
* Use Cases: Storing secondary backup copies of on-premise data, or data you
can recreate
- Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA)
What S3 Storage class:
Small monthly monitoring and auto-tiering fee
* Moves objects automatically between Access Tiers based on usage
* There are no retrieval charges in S3 Intelligent-Tiering
S3 Intelligent-Tiering
Describe the S3 storage Intelligent Tiering classes below:
*__________: default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* ______: objects not accessed for 90 days
* _________: configurable from 90 days to 700+ days
* ________: config. from 180 days to 700+ days
Frequent Access tier (automatic): default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* Archive Instant Access tier (automatic): objects not accessed for 90 days
* Archive Access tier (optional): configurable from 90 days to 700+ days
* Deep Archive Access tier (optional): config. from 180 days to 700+ days
- Help you decide when to transition objects
to the right storage class - Recommendations for Standard and
Standard IA - Does NOT work for One-Zone IA or Glacier
- Report is updated daily
- 24 to 48 hours to start seeing data analysis
- Good first step to put together Lifecycle
Rules (or improve them)!
Amazon S3 Analytics
bucket wide rules from the S3 console - allows cross account
S3 Bucket policies
_____ is a managed alternative to Apache Kafka
* Great for application logs, metrics, IoT, clickstreams
* Great for “real-time” big data
* Great for streaming processing frameworks (Spark, NiFi, etc…)
* Data is automatically replicated synchronously to 3 AZ
Amazon Kinesis
__________ low latency streaming ingest at scale
Kinesis Streams
________ perform real-time analytics on streams using SQL
Kinesis Analytics
_________ load streams into S3, Redshift, ElasticSearch & Splunk
Kinesis Firehose
______ meant for streaming video in real-time
Kinesis Video Streams
Kinesis Streams are divided in ordered ______
Shards
What are the two capacity modes for Kinesis Data streams?
Provisioned and On-Demand modes
What Kinesis data stream capacity mode is below:
*You choose the number of shards provisioned, scale manually or using API
* Each shard gets 1MB/s in (or 1000 records per second)
* Each shard gets 2MB/s out (classic or enhanced fan-out consumer)
* You pay per shard provisioned per hour
Provisioned
What Kinesis data stream capacity mode is below:
* No need to provision or manage the capacity
* Default capacity provisioned (4 MB/s in or 4000 records per second)
* Scales automatically based on observed throughput peak during the last 30
days
* Pay per stream per hour & data in/out per GB
On-demand mode
What Kinesis service is this:
*Fully Managed Service, no administration
* Near Real Time (60 seconds latency minimum for non full batches)
* Data Ingestion into Redshift / Amazon S3 / ElasticSearch / Splunk
* Automatic scaling
* Supports many data formats
* Data Conversions from CSV / JSON to Parquet / ORC (only for S3)
* Data Transformation through AWS Lambda (ex: CSV => JSON)
* Supports compression when target is Amazon S3 (GZIP, ZIP, and
SNAPPY
Kinesis data firehose
Whats the difference between kinesis data streams and firehose?
*Streams
* Going to write custom code (producer / consumer)
* Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
* Automatic scaling with On-demand Mode
* Data Storage for 1 to 365 days, replay capability, multi consumers
*Firehose
* Fully managed, send to S3, Splunk, Redshift, ElasticSearch
* Serverless data transformations with Lambda
* Near real time (lowest buffer time is 1 minute)
* Automated Scaling
* No data storage
What Kinesis tool is this:
Use cases
* Streaming ETL: select columns, make simple transformations, on streaming
data
* Continuous metric generation: live leaderboard for a mobile game
* Responsive analytics: look for certain criteria and build alerting (filtering)
* Features
* Pay only for resources consumed (but it’s not cheap)
* Serverless; scales automatically
* Use IAM permissions to access streaming source and destination(s)
* SQL or Flink to write the computation
* Schema discovery
* Lambda can be used for pre-processing
Kinesis data analytics