Data Engineering Flashcards

(57 cards)

1
Q

S3 key ?

A

it’s the

bucket name all the way to file extention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

max S3 file size?

A

5TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

S3 object tag use cases?

A

It’s a key/value thing

Lifecycle
Classify data
Security

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Storage part is S3. Name some computing?

A
EC2
Amazon Athena
Amazon Redshift Spectrum
Rekognition
AWS Glue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data partitioning on S3. how and why?

A

S3:/bucket/partions(year)….

to speed up range queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

S3 Encryption options

A

SSE-S3
SSE-KMS
SSE-C
CSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

S3 Access

A

User based:
- IAM

Resource Based

  • Overall bucket policy
  • ACL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What if we do not want to move the data in S3 over the internet?

A

use VPC Endpoint Gateway

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

S3 logs

A

S3 Access logs in another S3 bucket

API calls in CloudTrail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can you do S3 policy based on the tags?

A

yes you do.
add tag classification=PHI
and impose the restriction on whatever file has this tag

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Apache alternative of Kinesis

A

Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis use cases

A

Logs
Metrics
IoT
ClickStream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Some streaming frameworks

A

Spark
NiFi
etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kinesises

A

KDS: Low latency stream at scale

KDA: real-time analytics on stream using sql

KF: load stream into S3, Redshift, ES, Splunk

KVS: stream video in real-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

KDS Facts:

  • provision
  • retention
  • replay data
  • consumer quantity
  • edit ingested data
  • record size
A
  • provision Shards in advance
  • retention 24h to 7 days
  • Ability to reprocess and replay data
  • multiple consumer take off the same stream
  • immutable
  • 1MB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

KDS Producer limits

A

1MB or 1000 messages /shard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Consumer Classic limits

A

2MB/s/Shard

5 API calls/s/Shard across all consumers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

KDF min latency

A

near real-time

60 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

KDF targets

A

Redshift
Amazon S3
ElasticSearch
Splunk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

KDF scaling

A

Managed Auto-Scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

KDF Data conversion

A

CSV / JSON > Parquet / ORC

only for S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

KDF Data Transformation

A

using Lambda

CSV to JSON

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

KDF Compression

A

when target is S3

GZIP, ZIP, SNAPPY

24
Q

KDF Pricing

A

Pay as you go

25
KDF Sources
``` SDK KPL Kinesis Agent KDS CloudWatch logs and events IoT rule actions ```
26
KDS Latency
70ms (enhanced fan-out) | 200ms
27
KDF vs KDS
Ingestion (delivery) vs Streaming
28
Anomaly detection
RANDOM_CUT_FOREST it uses recent history
29
Detect Dense Areas
HOTSPOTS | locate and return information about relatively dense areas
30
Runtime options on Kinesis Data Analytics
SQL | Apache Flink
31
Every Kinesis Video Stream is capable of receiving how many video inputs ?
Just one 1000 cameras ? run 1000 KVS
32
KVS Inputs?
``` Cameras AWS DeepLense Smartphone camera Audio feed Images RTSP camera Producer SDK ```
33
KVS Targets?
``` SageMaker Amazon Rekognition Video EC2 Consumer - Tensorflow - MXNet ```
34
KVS data retention
1 hour to 10 years
35
Fargate ?
Runs containers and scaleeee
36
KVS use cases
feed camera to KVS run a container on Fargate - use DynamoDB for checkpointing Get decoded frames to SageMaker for ML inference Publish the result to KDS Fire off Lambda to e.g. send notifications
37
Which AWS services can use Glue Data Catalogue?
Amazon Redshift Amazon Athena > QuickSight Amazon EMR
38
Glue, does that transform as well somehow ?
``` yes Glue does Transformation Cleaning Enrich Data ``` using ETL code in Python or Scala or Provide Spark or PySpark
39
Glue Targets
S3 JDBC (RDS, Redshift) Glue Data Catalogue
40
Glue Cost modeling
Pay as you go for the resources consumed
41
Where is Glue running?
a Serverless Spark Platform
42
Glue Scheduler?
it does scheduling the jobs
43
Glue Triggers?
it automate job runs based on events
44
Glue Transformations, how?
Bundled: - DropFields, DropNullFields - remove null values - Filter to filter records - Join to enrich - Map to add fields, delete fields, perform external lookup Machine Learning Transformations: - Identify duplicates or matching records Format Conversions: CSV, JSON, Avro, ORC, Parquet, XML Apache Spark Transformations: K-Means
45
Glue Job types
``` Spark - Python 2 - Python 3 - Scala Python Shell ```
46
Name some AWS Storage services
Redshift - Columnar - SQL - OLAP - Load from S3 - Redshift Spectrum runs on S3 without loading RDS - OLTP - SQL DynamoDB - NoSQL S3 - Object Storage ElasticSearch - Search amongst data points - indexing - Clickstream analytics ElastiCache: - Caching and in-memory
47
AWS Data Pipeline Source/ Destinations
include: - S3 - RDS - Redshift - DynamoDB - EMR Datasource may be on-premise
48
Glue vs Data Pipeline
Glue - Managed - Run Apache Spark, Scala, Python - Focus on ETL and not configuration or managing resources Data Pipeline - Orchestration service - Gives more control over environment, codes, ec2... - allow access to EC2 or EMR instances
49
AWS Batch?
Run batch jobs as Docker images Dynamic provisioning of the instances (EC2 & Spot) Optimal quantity No need to manage clusters, fully serverless Pay for EC2 instances
50
How to schedule Batch jobs?
using CloudWatch events
51
How to orchestrate Batch jobs?
using AWS Step functions
52
AWS Batch vs Glue
Resources are created in the account by Batch Docker image must be provided Batch is better for non-ETL related work - e.g. Cleaning an S3 bucket Glue better for ETL and Transformation
53
AWS DMS, does the source remains available during the job?
yes
54
DMS vs Glue?
Glue min every 5min Data Migration Services (DMS) Real-time DMS doesn't do much transform
55
How DMS is real-time?
it uses Continuous Data Replication (CDR)
56
AWS Step functions?
Design Workflows Easy Visualization Error Handling and retry Audit the history Option to wait for an arbitrary amount of time Max execution of a state machine is 1 year
57
AWS Services for any sort of ETL?
Glue Batch Data Pipeline Step functions (to orchestrate)