Data Ingestion and Storage Flashcards

(23 cards)

1
Q

What are the two types of data?

A

Structured and unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is structured data?

A

Data organised in a defined manner or schema, typically found in relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is unstructured data?

A

Data that doesn’t have a predefined structure or schema, e.g. audio files or raw text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is semi-structured data?

A

Data that has some level of structure but isn’t fully organised such as XML, JSON

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 Vs of data?

A

Volume, velocity, variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the core difference between a data warehouse and a data lake?

A

A data warehouse does schema-on-write (when the data is put in) and thus does ETL.
A data lake does schema-on-read and thus does ELT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an advantage of a data warehouse over a data lake?

A

Better for faster and more complex queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an advantage of a data lake over a data warehouse?

A

More flexible, scalable and cost effective

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the idea of a data mesh?

A

The idea that each team owns the data that they use/know the most about.
They have to make sure that the data is secure and complies w/ centralised standards on data security

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Avro?

A

A binary data format. Stores both the data and its schema. Used for big data and real time systems - Kafka, Spark, Flink, Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Parquet?

A

Columnar data format for analytics. Efficiently compresses and encodes. Useful for querying information where you might only need specific columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can you decrease EBS volume size on-the-fly?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can you change your EBS volume type without having to restart to detach your volume?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What output options does Kinesis Data Streams have?

A

KDF, Lamda, MSK, or your applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How long does Kinesis Data Streams hold data for replay?

A

Up to 365 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is Kinesis Data Streams encrypted at rest and in-flight?

A

Yes - at rest w/ KMS, in-flight w/ HTTPS

17
Q

What are Kinesis Data Stream’s two capacity modes? How do you pay for each?

A

Provisioned and on-demand. Provisioned you pay per shard provisioned per hour, on-demand you pay per stream per hour.

18
Q

What are the differences between Kinesis Data Streams and Kinesis Data Firehose?

A

KDS is real time, has less possible output locations, has replay and data storage.
KDF is near-real-time, has more possible output locations, no replay or data storage

19
Q

What is Amazon Managed Service for Apache Flink?

A

Used to be Kinesis Data Analytics.
Query or transform data as it is being streamed

20
Q

Where can Managed Service for Apache Flink take inputs from and put outputs to?

A

Inputs are KDF and KDS.
Outputs are KDF, KDS and Lambda.

21
Q

What is Amazon Managed Kafka Service an alternative to?

A

Kinesis Data Streams

22
Q

What is the maximum message size in Managed Kafka Service? How does this compare to KDS?

A

10MB, KDS is 1MB

23
Q

What is Amazon MSK Connect?

A

A managed service that works as a plug-in that is used to move data in and out of MSK.