Data Engineering Flashcards

Question

KDF Sources

Answer 1

``` SDK KPL Kinesis Agent KDS CloudWatch logs and events IoT rule actions ```

Answer 2

70ms (enhanced fan-out) | 200ms

Answer 3

Ingestion (delivery) vs Streaming

Answer 4

RANDOM_CUT_FOREST it uses recent history

Answer 5

HOTSPOTS | locate and return information about relatively dense areas

Answer 6

SQL | Apache Flink

Answer 7

Just one 1000 cameras ? run 1000 KVS

Answer 8

``` Cameras AWS DeepLense Smartphone camera Audio feed Images RTSP camera Producer SDK ```

Answer 9

``` SageMaker Amazon Rekognition Video EC2 Consumer - Tensorflow - MXNet ```

Answer 10

1 hour to 10 years

Answer 11

Runs containers and scaleeee

Answer 12

feed camera to KVS run a container on Fargate - use DynamoDB for checkpointing Get decoded frames to SageMaker for ML inference Publish the result to KDS Fire off Lambda to e.g. send notifications

Answer 13

Amazon Redshift Amazon Athena > QuickSight Amazon EMR

Answer 14

``` yes Glue does Transformation Cleaning Enrich Data ``` using ETL code in Python or Scala or Provide Spark or PySpark

Answer 15

S3 JDBC (RDS, Redshift) Glue Data Catalogue

Answer 16

Pay as you go for the resources consumed

Answer 17

a Serverless Spark Platform

Answer 18

it does scheduling the jobs

Answer 19

it automate job runs based on events

Answer 20

Bundled: - DropFields, DropNullFields - remove null values - Filter to filter records - Join to enrich - Map to add fields, delete fields, perform external lookup Machine Learning Transformations: - Identify duplicates or matching records Format Conversions: CSV, JSON, Avro, ORC, Parquet, XML Apache Spark Transformations: K-Means

Answer 21

``` Spark - Python 2 - Python 3 - Scala Python Shell ```

Answer 22

Redshift - Columnar - SQL - OLAP - Load from S3 - Redshift Spectrum runs on S3 without loading RDS - OLTP - SQL DynamoDB - NoSQL S3 - Object Storage ElasticSearch - Search amongst data points - indexing - Clickstream analytics ElastiCache: - Caching and in-memory

Answer 23

include: - S3 - RDS - Redshift - DynamoDB - EMR Datasource may be on-premise

Answer 24

Glue - Managed - Run Apache Spark, Scala, Python - Focus on ETL and not configuration or managing resources Data Pipeline - Orchestration service - Gives more control over environment, codes, ec2... - allow access to EC2 or EMR instances

Answer 25

Run batch jobs as Docker images Dynamic provisioning of the instances (EC2 & Spot) Optimal quantity No need to manage clusters, fully serverless Pay for EC2 instances

Answer 26

using CloudWatch events

Answer 27

using AWS Step functions

Answer 28

Resources are created in the account by Batch Docker image must be provided Batch is better for non-ETL related work - e.g. Cleaning an S3 bucket Glue better for ETL and Transformation

Answer 29

Glue min every 5min Data Migration Services (DMS) Real-time DMS doesn't do much transform

Answer 30

it uses Continuous Data Replication (CDR)

Answer 31

Design Workflows Easy Visualization Error Handling and retry Audit the history Option to wait for an arbitrary amount of time Max execution of a state machine is 1 year

Answer 32

Glue Batch Data Pipeline Step functions (to orchestrate)

Data Engineering Flashcards

(57 cards)