Data Analytics Flashcards

(33 cards)

1
Q

How many hours is data available in the moving time window that Kinesis Stream uses?

A

24 hours (can be increased to 7 days for additional cost)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How many MB does a single Shard in Kinesis allow for ingestion and consumption?

A

1 MB for Ingestion

2 MB for Consumption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many Shards does a Kinesis Stream have when newly created?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the size of a single Kinesis Data Record?

A

1 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How quickly is data delivered using Kinesis Firehose?

A

Near-Real-Time, anything between 1-60 seconds (depends on the amount being ingested, i.e. how quickly the 1 MB buffer it uses is filled up).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 11 valid destinations for Kinesis Firehose?

A
  • Amazon OpenSearch/Elasticsearch Service
  • Amazon Redshift
  • Amazon S3
  • Http endpoints
  • Datadog
  • Dynatrace
  • LogicMonitor
  • MongoDB
  • New Relic
  • Splunk
  • Sumo Logic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How quickly is data delivered through Kinesis Streams?

A

In Real-Time (~ 200 ms)

Not to be confused with Kinesis Firehose, that delivers Near-Real-Time only!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s the right product to use when (potentially complex) real-time SQL processing is required?

A

Kinesis Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the six 3rd party big data products does Amazon EMR provides as a managed service?

A
  • Spark
  • Hadoop (incl. Pig)
  • HBase
  • Hive
  • Hudi
  • Presto
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is Amazon EMR a Multi-AZ or Single-AZ product?

A

Single-AZ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What compute products can be used with Amazon EMR (i.e. which compute products are used to run EMR)?

A

EC2 & EKS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What’s the master node used for with Amazon EMR?

A
  • manages the cluster and its health
  • distributes workloads
  • acts as the NAME node within MapReduce
  • allows SSH access to the cluster
  • if it’s the only node in the cluster: runs MapReduce workload
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are core nodes used for with Amazon EMR?

A
  • provide the HDFS (Hadoop File System)
  • run task trackers
  • can run MapReduce workload

Note: losing a core node means losing HDFS and track of tasks => should not be run on Spot instances!
Note #2: Multi-node clusters have at least one core node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are task nodes used for with Amazon EMR?

A
  • run MapReduce workload

Note: ideal to be run on Spot instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s EMRFS?

A

S3-based file system for EMR. Can be used to store results of EMR workloads to ensure resilience with EMR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What’s the right product to use when you want to directly query S3 data via Redshift?

A

Redshift Spectrum

17
Q

Is Amazon Redshift a Multi-AZ or Single-AZ product?

18
Q

What’s the role of the Leader Node in Amazon Redshift?

A

Receive query input and distribute it to Compute nodes for execution

19
Q

If you want to customize the network options for Amazon Redshift, what do you need to enable?

A

Enhanced VPC Routing

20
Q

At which intervals are automatic snapshots taken with Amazon Redshift?

A

Every ~8 hours or ~5 GB

21
Q

What are valid data sources for Amazon Redshift (name 7)?

A
Amazon S3
Amazon RDS
Amazon DynamoDB
Amazon EMR
AWS Glue
AWS Data Pipeline 
SSH-enabled host on Amazon EC2 or on-premises
22
Q

What are the available retention periods available for automatic snapshots taken with Amazon Redshift?

A

Anything between 1 day (default) up to 35 days.

23
Q

What are valid data sources for AWS Batch?

A
  • AWS Step Functions
  • AWS Lambda
  • Amazon EventBridge
  • Amazon S3
24
Q

What’s the right product to use for long-running (> 15 minutes) compute tasks?

A

NOT AWS Lambda!

Use AWS Batch, EC2, ECS instead for example

25
Is AWS Batch serverless?
No
26
How many records per second does a single Shard in Kinesis allow for ingestion and consumption?
1000 records / second
27
What is default limit for the number of Shards in Kinesis?
500, but can be increased unlimited
28
What does a Kinesis Shard consist of?
Partition Key, Sequence Number, Data
29
What are the two main file systems used with Amazon EMR and what are their key differences?
HDFS and EMRFS HDFS is fast, but ephemeral. EMRFS is slower, but persistent as backed by S3.
30
What's the Amazon Kinesis Client Library (KCL) and when would you use it?
Library for reading and processing data from an Amazon Kinesis data stream. Removes some of the heavy-lifting when working with stream data, therefore more efficient than using the Kinesis API directly.
31
What are the differences between AWS Glue and AWS Data Pipeline in regards to the compute infrastructure (and control of that infrastructure)?
- AWS Glue is serverless (but uses Apache Spark behind the scenes). No direct control on the compute resources. - AWS Data Pipeline spins up EMR clusters and EC2 instances, which can be accessed directly.
32
What are the differences between AWS Glue and AWS Data Pipeline in regards to the engines they use?
- AWS Glue uses a serverless Apache Spark engine and generates Scala or Python code - AWS Data Pipeline uses Amazon EMR and through that is flexible on the engine (Spark, Hive, Hudi, Pig, etc.)
33
What AWS service makes use of Apache Flink capabilities?
Kinesis Data Analytics