Data Ingestion and Storage Flashcards
(23 cards)
What are the two types of data?
Structured and unstructured data
What is structured data?
Data organised in a defined manner or schema, typically found in relational databases
What is unstructured data?
Data that doesn’t have a predefined structure or schema, e.g. audio files or raw text
What is semi-structured data?
Data that has some level of structure but isn’t fully organised such as XML, JSON
What are the 3 Vs of data?
Volume, velocity, variety
What is the core difference between a data warehouse and a data lake?
A data warehouse does schema-on-write (when the data is put in) and thus does ETL.
A data lake does schema-on-read and thus does ELT.
What is an advantage of a data warehouse over a data lake?
Better for faster and more complex queries
What is an advantage of a data lake over a data warehouse?
More flexible, scalable and cost effective
What is the idea of a data mesh?
The idea that each team owns the data that they use/know the most about.
They have to make sure that the data is secure and complies w/ centralised standards on data security
What is Avro?
A binary data format. Stores both the data and its schema. Used for big data and real time systems - Kafka, Spark, Flink, Hadoop
What is Parquet?
Columnar data format for analytics. Efficiently compresses and encodes. Useful for querying information where you might only need specific columns
Can you decrease EBS volume size on-the-fly?
No
Can you change your EBS volume type without having to restart to detach your volume?
Yes
What output options does Kinesis Data Streams have?
KDF, Lamda, MSK, or your applications
How long does Kinesis Data Streams hold data for replay?
Up to 365 days
Is Kinesis Data Streams encrypted at rest and in-flight?
Yes - at rest w/ KMS, in-flight w/ HTTPS
What are Kinesis Data Stream’s two capacity modes? How do you pay for each?
Provisioned and on-demand. Provisioned you pay per shard provisioned per hour, on-demand you pay per stream per hour.
What are the differences between Kinesis Data Streams and Kinesis Data Firehose?
KDS is real time, has less possible output locations, has replay and data storage.
KDF is near-real-time, has more possible output locations, no replay or data storage
What is Amazon Managed Service for Apache Flink?
Used to be Kinesis Data Analytics.
Query or transform data as it is being streamed
Where can Managed Service for Apache Flink take inputs from and put outputs to?
Inputs are KDF and KDS.
Outputs are KDF, KDS and Lambda.
What is Amazon Managed Kafka Service an alternative to?
Kinesis Data Streams
What is the maximum message size in Managed Kafka Service? How does this compare to KDS?
10MB, KDS is 1MB
What is Amazon MSK Connect?
A managed service that works as a plug-in that is used to move data in and out of MSK.