Big Data Flashcards
What are the 3 V’s of Big Data?
- Volume (ranges from terabytes to petabytes of data)
- Variety (wide range of sources and formats)
- Velocity (businesses require speed; data needs to be collected, stored, processed, and analyzed within a short period of time)
What is Redshift?
Redshift is a fully managed, petabyte scale data warehouse service in the cloud. It’s a very large relational database traditionally used in big applications.
What are the features that make Redshift different that a traditional relational database?
- Size - Redshift can hold up to 16 petabytes of data so you don’t have to split up your datasets into multiple databases
- Relational - very big relational database
- Usage - it isn’t a replacement for RDS; it’s focus is for BI
Is Redshift a highly available service?
No, it only comes online in one AZ; if you want it in multiple AZs, you will have to create multiple copies
What is ETL?
Extract-Transform-Load
What is EMR?
EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase, Flink, Hudi and Presto.
It’s AWS’s ETL tool.
It’s not proprietary to Amazon.
What is the architecture of EMR?
When you spin up an EMR cluster, it will live inside of your VPC.
For the purpose of the exam, will focus on using EC2 instances (but it can also run on EKS and Outpost).
EMR will spin up the instances for you, keep them online, manage them for you. It will take in data, process it putting it into the form you want, and then store in S3 bucket.
If you see a scenario asking about optimizing cost of EC2 instances in EMR, what options do you have?
You can use reserved instances and spot instances because you have control over the types of instances used.
What is Kinesis?
Allows you to ingest, process and analyze real-time streaming data. You can think of it as a huge data highway connected to your AWS account.
What are the two types of Kinesis?
Data Streams
- Purpose is real-time streaming for ingesting data
- Real-time, but a lot of work to put together
- Your responsible for creating the consumer and scaling the stream
Firehose
- Data transfer tool to get information to S3, Redshift, Elastisearch, or Splunk
- Near-real time (within 60 seconds), but much easier
- More difficult to plug-and-play with AWS architecture
What is the architecture for Kinesis Data Streams?
What is the architecture for Kinesis Firehose?
What do you use if you need to analyze data as it is flowing through Kinesis Data Stream or Firehose?
Kinesis Data Analytics (using standard SQL)
- Easy to tie Data Analytics into your Kinesis Pipeline; it’s directly supported by Data Firehose and Data Streams
- No servers - it is a fully managed, real-time serverless service
- Cost - you pay for the data that passes through
When you are looking for a messaging broker, which do you pick?
- SQS (simple, doesn’t require much configuration, doesn’t offer real-time message delivery)
- Kinesis (a bit more complicated to configure and is mostly used in big data applications and it does provide real-time communication)
If you are given a scenario where you need a message broker that delivers in real-time, what would you recommend?
Kinesis (Data Streams)
If you are given a scenario where you need a message broker that delivers in near real-time, what would you recommend?
Kinesis Data Firehose
If a scenario talks about streaming data, what service would you recommend?
Some form of Kinesis
If you are given a scenario that needs to automatically scale your streaming service, what service would you recommend?
Kinesis Data Firehose (only option that offers automatic scaling)
What is Athena?
Athena is a serverless, interactive query service that makes it easy to analyze data in S3 using SQL. This allows you to directly query data in your S3 bucket without loading it in the database.