Big Data Flashcards by Jana Dohman

What are the 3 V’s of Big Data?

Volume (ranges from terabytes to petabytes of data)
Variety (wide range of sources and formats)
Velocity (businesses require speed; data needs to be collected, stored, processed, and analyzed within a short period of time)

How well did you know this?

Not at all

Perfectly

What is Redshift?

Redshift is a fully managed, petabyte scale data warehouse service in the cloud. It’s a very large relational database traditionally used in big applications.

How well did you know this?

Not at all

Perfectly

What are the features that make Redshift different that a traditional relational database?

Size - Redshift can hold up to 16 petabytes of data so you don’t have to split up your datasets into multiple databases
Relational - very big relational database
Usage - it isn’t a replacement for RDS; it’s focus is for BI

How well did you know this?

Not at all

Perfectly

Is Redshift a highly available service?

No, it only comes online in one AZ; if you want it in multiple AZs, you will have to create multiple copies

How well did you know this?

Not at all

Perfectly

What is ETL?

Extract-Transform-Load

How well did you know this?

Not at all

Perfectly

What is EMR?

EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase, Flink, Hudi and Presto.

It’s AWS’s ETL tool.

It’s not proprietary to Amazon.

How well did you know this?

Not at all

Perfectly

What is the architecture of EMR?

When you spin up an EMR cluster, it will live inside of your VPC.

For the purpose of the exam, will focus on using EC2 instances (but it can also run on EKS and Outpost).

EMR will spin up the instances for you, keep them online, manage them for you. It will take in data, process it putting it into the form you want, and then store in S3 bucket.

How well did you know this?

Not at all

Perfectly

If you see a scenario asking about optimizing cost of EC2 instances in EMR, what options do you have?

You can use reserved instances and spot instances because you have control over the types of instances used.

How well did you know this?

Not at all

Perfectly

What is Kinesis?

Allows you to ingest, process and analyze real-time streaming data. You can think of it as a huge data highway connected to your AWS account.

How well did you know this?

Not at all

Perfectly

What are the two types of Kinesis?

Data Streams

Purpose is real-time streaming for ingesting data
Real-time, but a lot of work to put together
Your responsible for creating the consumer and scaling the stream

Firehose

Data transfer tool to get information to S3, Redshift, Elastisearch, or Splunk
Near-real time (within 60 seconds), but much easier
More difficult to plug-and-play with AWS architecture

How well did you know this?

Not at all

Perfectly

What is the architecture for Kinesis Data Streams?

How well did you know this?

Not at all

Perfectly

What is the architecture for Kinesis Firehose?

How well did you know this?

Not at all

Perfectly

What do you use if you need to analyze data as it is flowing through Kinesis Data Stream or Firehose?

Kinesis Data Analytics (using standard SQL)

Easy to tie Data Analytics into your Kinesis Pipeline; it’s directly supported by Data Firehose and Data Streams
No servers - it is a fully managed, real-time serverless service
Cost - you pay for the data that passes through

How well did you know this?

Not at all

Perfectly

When you are looking for a messaging broker, which do you pick?

SQS (simple, doesn’t require much configuration, doesn’t offer real-time message delivery)
Kinesis (a bit more complicated to configure and is mostly used in big data applications and it does provide real-time communication)

How well did you know this?

Not at all

Perfectly

If you are given a scenario where you need a message broker that delivers in real-time, what would you recommend?

Kinesis (Data Streams)

How well did you know this?

Not at all

Perfectly

If you are given a scenario where you need a message broker that delivers in near real-time, what would you recommend?

Kinesis Data Firehose

How well did you know this?

Not at all

Perfectly

If a scenario talks about streaming data, what service would you recommend?

Some form of Kinesis

How well did you know this?

Not at all

Perfectly

If you are given a scenario that needs to automatically scale your streaming service, what service would you recommend?

Kinesis Data Firehose (only option that offers automatic scaling)

How well did you know this?

Not at all

Perfectly

What is Athena?

Athena is a serverless, interactive query service that makes it easy to analyze data in S3 using SQL. This allows you to directly query data in your S3 bucket without loading it in the database.

How well did you know this?

Not at all

Perfectly

What is AWS Glue?

Study These Flashcards

Glue is a serverless data integration service that makes it easy to discover, prepare and combine data. It allows you to perform ETL workloads without managing underlying servers.

It replaces EMR.

How do you put Athena and Glue together?

Study These Flashcards

Point Glue at the data in S3 to build a catalog.

Once that is built, you have a couple of options:

Use Amazon Redshift Spectrum (flavor of Redshift where it allows you to use Redshift without loading all the data into Redshift
If you aren’t already using Redshift, another option is to use a Athena to take the data that was structured by Glue and run queries on it without having to load it into the database

You could then use something like Quicksight (Amazon’s version of Tableau) to visualize the data in a dashboard

If you are given a scenario that asks about needing serverless SQL solution to query BI data or logs, what service would you recommend?

Study These Flashcards

Athena

What is Quicksight?

Study These Flashcards

Amazon Quicksight is a fully managed business intelligence (BI) data visualization service. It allows you to easily create dashboards and share them within your company.

Similar to Tableau

What are common ways to incorporate Quicksight into an architecture?

Study These Flashcards

Somewhere you need a data visualization tool (integrates with Athena)

If you are given a scenario that talks about sharing your data, interpreting that data, or anything related to business intelligence, what service would you recommend?

Possibly Quicksight

What is AWS Data Pipeline?

A managed Extract, Transform, Load (ETL) service for automating movement and transformation of your data 1. Define data-driven workflows (steps are dependent on previous tasks completing successfully) 2. Define parameters (it enforces your chose logic) 3. Highly available 4. Handles failures (automatically retries failed activities and can notify SNS for failures or even successes 5. Integrates with many data storage services (DynamoDB, RDS, Redshift, S3) 6. Works closely with EC2 and EMR for compute needs Use data-driven workflows to create dependencies between tasks and activities

What is a Pipeline Definition in AWS Data Pipeline?

Specify the business logic of your data management needs

What is Managed Compute in AWS Data Pipeline?

Service will create EC2 instances to perform your activities (or leverage existing EC2s)

What are Task Runners in AWS Data Pipeline?

Task runners (EC2) poll for different tasks and perform them when found

What are Data Nodes in AWS Data Pipeline?

Define the locations and types of data that will be input and output

What are Activities in AWS Data Pipeline?

Pipeline components that define the work to perform

What are some popular use cases for AWS Data Pipeline?

1. Processing data in EMR using Hadoop streaming 2. Importing or exporting DynamoDb data 3. Copying CSV files or data between S3 buckets 4. Exporting RDS data to S3 5. Copying data to Redshift

What services integrate with AWS Data Pipeline?

Storage Integrations: DynamoDB, RDS, Redshift, S3 Compute Integrations: EC2 and EMR Notifications: SNS for failure (or success) notifications

If you are given a scenario that asks about managed ETL services with automatic retries for data driven workflows, what service would you recommend?

AWS Data Pipeline

What is Amazon MSK?

Fully managed streaming service for running data streaming applications that leverage Apache Kafka Provides control-plane operations (creates, updates, deletes clusters as required) Leverage the data plane for Kafka for producing and consuming streaming data

What is Amazon MSK best used for?

Existing applications that need to leverage the open source version of Apache Kafka

What is a Broker Node in Amazon MSK?

Specify the number of broker nodes per AZ you want at the time of cluster creation

What is a Zookeeper Node in Amazon MSK?

Zookeeper nodes are created for you

What are Producers, Consumers and Topics in Amazon MSK?

Kafka data-plane operations allow creation of topics and ability to produce/consume data

What are Flexible Cluster Operations in Amazon MSK?

Performs cluster operations with the console, CLI, or APIs within the SDK

How resilient is Amazon MSK?

1. Automatic Recovery (detection and recovery from common scenarios with minimal impact) 2. Detection (detects broker failures result in mitigation or replacement of unhealthy nodes) 3. Reduce data (tries to use storage from older brokers during failures to reduce data needing replication) 4. Impact time is limited to however long it takes Amazon MSK to complete detection and recovery 5. After recovery, producer and consumer apps continue to communicate with the same broker IP as before

What is Amazon MSK Serverless?

A cluster type within Amazon MSK offering serverless cluster management (automatic provisioning and scaling) Fully compatible with Apache Kafka

What is Amazon MSK Connect?

Allows developers to easily stream data to and from Apache Kafka clusters.

What are the security and logging features of Amazon MSK

1. Integrates with Amazon KMS for server-side encryption requirements 2. Encryption at rest is the default 3. Uses TLS 1.2 for encryption in transit between brokers in clusters 4. Broker logs can be delivered to CloudWatch, S3, and Kinesis Data Firehose 5. By default, metrics are collected and sent to CloudWatch 6. All API calls are tracked using CloudTrail

What is Amazon OpenSearch?

Managed service allowing you to run search and analytics engines for various use cases It is the successor to Amazon Elasticsearch Service

What are the features of Amazon OpenSearch?

1. Quick Analysis (quickly ingest, search and analyze data in your clusters; commonly part of an ETL process) 2. Highly scalable services 3. Ties into IAM, encryption at rest and field-level security 4. Multi-AZ capable with master nodes and automated snapshots 5. Allows you to use SQL for BI apps 6. Easily integrates with CloudWatch, CloudTrail, S3 and Kinesis

If you are given a scenario on the exam that talks about creating a logging solution involving visualization of the log file analytics of BI reports, what service might be included?

Amazon OpenSearch

What is the predecessor to Amazon OpenSearch?

Amazon ElasticSearch (the same concepts appy)

Big Data Flashcards

(48 cards)