[MLS] Data Engineering Flashcards
(21 cards)
What is the maximum object size in Amazon S3?
5TB
What are the key characteristics of S3 as a data lake?
It has ‘infinite size’, doesn’t require provisioning, decouples storage from compute, is centralized, and supports any file format
Why is S3 data partitioning used?
To speed up queries of a range. Can be partitioned by date, product, or other useful criteria
What are the main S3 storage classes?
S3 Standard, S3 Standard IA, S3 One Zone IA, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, and S3 Intelligent Tiering
What is the durability and availability standard for S3?
Durability is 11 9s (99.999999999%), Availability is 99.99% (approximately 53 minutes downtime per year)
What are the main types of S3 access controls?
Bucket policies (resource-based, bucket-wide), Object ACLs (fine-grain control), and Bucket ACLs (less common)
What is the primary function of Kinesis Data Firehose?
To ingest massive amounts of data in near-real-time
What are the main capabilities of Managed Service for Apache Flink?
Automatic provisioning, parallelization, automatic scaling, real-time data processing, streaming ETL
What are the key components of Kinesis Video Streams?
Supports one producer per video stream, provides video playback, can have consumers like EC2, containers, SageMaker, and Rekognition
What can Glue Crawlers do?
Infer schemas and partitions from data, work with various formats (JSON, Parquet, CSV) and sources (S3, Redshift, RDS)
What programming languages does Glue ETL support?
Python and Scala
What are some example transformations available in Glue?
Filter, Join, Map, DropFields, DropNull Fields, FindMatches ML, format conversions
What is AWS Glue DataBrew?
A tool to clean and normalize data without coding, using over 250 pre-made transformations
What are the main data stores used in ML on AWS?
Redshift, RDS, DynamoDB, S3, OpenSearch, ElastiCache
How does AWS Data Pipeline differ from AWS Glue?
Data Pipeline is an orchestrator using EC2 instances in your account, while Glue focuses on ETL using Spark and manages the resources for you
What is AWS Batch used for?
Running batch jobs from Docker images, serverless, can work with spot instances, orchestrated with step functions
What are the key features of AWS Database Migration Service?
Supports homogeneous and heterogeneous migrations, keeps source database available during migration, provides continuous data replication, runs on EC2
What is the maximum execution time for AWS Step Functions?
One year
What is MQTT used for in AWS?
It’s an IoT protocol used for messaging, particularly for getting sensor data into ML models
What does AWS DataSync require for on-premises connections?
A DataSync Agent deployed as a VM to connect to internal storage for replication
What is AWS Data Pipeline?
A service to move data from one place to another, just an orchestrator that retries and notifies on failures