Domain 1 Solutions Flashcards
Helps you set up a secure data lake and govern, secure, and globally share data for ML and analytics. Manages fine-grained access control on S3 and metadata in Glue Data Catalog with its own permissions model that augments IAM
Lake Formation
Preferred storage option
S3
Used to build, train, and deploy ML models
SageMaker
A file system service that speeds up training jobs by serving your S3 data to SageMaker at high speeds
FSx for Lustre
A training data source that directly launches training jobs from service w/out need for data movement for faster training start times
EFS
Block-level storage device that you can attach to your instances and use as you would use a physical hard drive
EBS
An ETL service to categorize, clean, enrich, and move data b/w various data stores that’s used for batch ingestions, automates data discovery
Glue
This batch ingestion service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval
DMS
Batch ingestion service that automates various ETL tasks that involve complex workflows
Step Functions
Uses Kinesis Producer Library to write to Kinesis data stream
Kinesis Data Streams
Batch/compress data to generate incremental views and execute custom transformation logic using Lambda before delivering incremental view to S3
Kinesis Firehose
Easiest way to process/transform data streaming thru Kinesis Data Streams or Firehose using SQL and provides insights in near real-time from incremental streams before storing in S3
Kinesis Data Analytics
Used to ingest/analyze video/audio data
Kinesis Video Streams
A distributed data store optimized for ingesting and processing streaming data in real-time. Used to publish and subscribe to streams of records, effectively store streams of records in the order in which records were generated, and process streams of records in real time
Apache Kafka
Supports many instance types that have proportionally high CPU with increased network performance, which is well suited for HPC (high-performance computing) applications
EMR
Customers can store a single source of data in Amazon S3 and perform ad hoc analysis
Athena
Uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning to deliver the best price performance at any scale
Redshift
Provides a protocol of data processing and node task distribution and management and uses algorithms to split datasets into subsets and distribute them across nodes in a compute cluster
Spark
A managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data
EMR
Use to build a visual dashboard for metrics
QuickSight
An open-source Java software framework that supports massive data processing across a cluster of instances. Uses various processing models, such as MapReduce, to distribute processing across multiple instances and also uses a distributed file system called HDFS to store data across multiple instances
Hadoop
A serverless, NoSQL, fully managed database with single-digit millisecond performance at any scale that addresses need to overcome scaling and operational complexities of relational databases
DynamoDB
A service that allows you to visually prepare and clean your data, normalize your data, and run a number of different feature transforms on the dataset without writing code
Glue Data Brew
An agnostic, free, open-source command line tool that works on top of Git repositories
Data Version Control (DVC)