Big Data Flashcards

1
Q

What are the 3 V’s of Big Data?

A
  1. Volume (ranges from terabytes to petabytes of data)
  2. Variety (wide range of sources and formats)
  3. Velocity (businesses require speed; data needs to be collected, stored, processed, and analyzed within a short period of time)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Redshift?

A

Redshift is a fully managed, petabyte scale data warehouse service in the cloud. It’s a very large relational database traditionally used in big applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the features that make Redshift different that a traditional relational database?

A
  1. Size - Redshift can hold up to 16 petabytes of data so you don’t have to split up your datasets into multiple databases
  2. Relational - very big relational database
  3. Usage - it isn’t a replacement for RDS; it’s focus is for BI
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is Redshift a highly available service?

A

No, it only comes online in one AZ; if you want it in multiple AZs, you will have to create multiple copies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ETL?

A

Extract-Transform-Load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is EMR?

A

EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase, Flink, Hudi and Presto.

It’s AWS’s ETL tool.

It’s not proprietary to Amazon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the architecture of EMR?

A

When you spin up an EMR cluster, it will live inside of your VPC.

For the purpose of the exam, will focus on using EC2 instances (but it can also run on EKS and Outpost).

EMR will spin up the instances for you, keep them online, manage them for you. It will take in data, process it putting it into the form you want, and then store in S3 bucket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If you see a scenario asking about optimizing cost of EC2 instances in EMR, what options do you have?

A

You can use reserved instances and spot instances because you have control over the types of instances used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Kinesis?

A

Allows you to ingest, process and analyze real-time streaming data. You can think of it as a huge data highway connected to your AWS account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two types of Kinesis?

A

Data Streams

  1. Purpose is real-time streaming for ingesting data
  2. Real-time, but a lot of work to put together
  3. Your responsible for creating the consumer and scaling the stream

Firehose

  1. Data transfer tool to get information to S3, Redshift, Elastisearch, or Splunk
  2. Near-real time (within 60 seconds), but much easier
  3. More difficult to plug-and-play with AWS architecture
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the architecture for Kinesis Data Streams?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the architecture for Kinesis Firehose?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do you use if you need to analyze data as it is flowing through Kinesis Data Stream or Firehose?

A

Kinesis Data Analytics (using standard SQL)

  1. Easy to tie Data Analytics into your Kinesis Pipeline; it’s directly supported by Data Firehose and Data Streams
  2. No servers - it is a fully managed, real-time serverless service
  3. Cost - you pay for the data that passes through
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When you are looking for a messaging broker, which do you pick?

A
  1. SQS (simple, doesn’t require much configuration, doesn’t offer real-time message delivery)
  2. Kinesis (a bit more complicated to configure and is mostly used in big data applications and it does provide real-time communication)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If you are given a scenario where you need a message broker that delivers in real-time, what would you recommend?

A

Kinesis (Data Streams)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If you are given a scenario where you need a message broker that delivers in near real-time, what would you recommend?

A

Kinesis Data Firehose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If a scenario talks about streaming data, what service would you recommend?

A

Some form of Kinesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

If you are given a scenario that needs to automatically scale your streaming service, what service would you recommend?

A

Kinesis Data Firehose (only option that offers automatic scaling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Athena?

A

Athena is a serverless, interactive query service that makes it easy to analyze data in S3 using SQL. This allows you to directly query data in your S3 bucket without loading it in the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is AWS Glue?

A

Glue is a serverless data integration service that makes it easy to discover, prepare and combine data. It allows you to perform ETL workloads without managing underlying servers.

It replaces EMR.

21
Q

How do you put Athena and Glue together?

A

Point Glue at the data in S3 to build a catalog.

Once that is built, you have a couple of options:

  1. Use Amazon Redshift Spectrum (flavor of Redshift where it allows you to use Redshift without loading all the data into Redshift
  2. If you aren’t already using Redshift, another option is to use a Athena to take the data that was structured by Glue and run queries on it without having to load it into the database

You could then use something like Quicksight (Amazon’s version of Tableau) to visualize the data in a dashboard

22
Q

If you are given a scenario that asks about needing serverless SQL solution to query BI data or logs, what service would you recommend?

A

Athena

23
Q

What is Quicksight?

A

Amazon Quicksight is a fully managed business intelligence (BI) data visualization service. It allows you to easily create dashboards and share them within your company.

Similar to Tableau

24
Q

What are common ways to incorporate Quicksight into an architecture?

A

Somewhere you need a data visualization tool (integrates with Athena)

25
Q

If you are given a scenario that talks about sharing your data, interpreting that data, or anything related to business intelligence, what service would you recommend?

A

Possibly Quicksight

26
Q

What is AWS Data Pipeline?

A

A managed Extract, Transform, Load (ETL) service for automating movement and transformation of your data

  1. Define data-driven workflows (steps are dependent on previous tasks completing successfully)
  2. Define parameters (it enforces your chose logic)
  3. Highly available
  4. Handles failures (automatically retries failed activities and can notify SNS for failures or even successes
  5. Integrates with many data storage services (DynamoDB, RDS, Redshift, S3)
  6. Works closely with EC2 and EMR for compute needs

Use data-driven workflows to create dependencies between tasks and activities

27
Q

What is a Pipeline Definition in AWS Data Pipeline?

A

Specify the business logic of your data management needs

28
Q

What is Managed Compute in AWS Data Pipeline?

A

Service will create EC2 instances to perform your activities (or leverage existing EC2s)

29
Q

What are Task Runners in AWS Data Pipeline?

A

Task runners (EC2) poll for different tasks and perform them when found

30
Q

What are Data Nodes in AWS Data Pipeline?

A

Define the locations and types of data that will be input and output

31
Q

What are Activities in AWS Data Pipeline?

A

Pipeline components that define the work to perform

32
Q

What are some popular use cases for AWS Data Pipeline?

A
  1. Processing data in EMR using Hadoop streaming
  2. Importing or exporting DynamoDb data
  3. Copying CSV files or data between S3 buckets
  4. Exporting RDS data to S3
  5. Copying data to Redshift
33
Q

What services integrate with AWS Data Pipeline?

A

Storage Integrations:
DynamoDB, RDS, Redshift, S3

Compute Integrations:
EC2 and EMR

Notifications:
SNS for failure (or success) notifications

34
Q

If you are given a scenario that asks about managed ETL services with automatic retries for data driven workflows, what service would you recommend?

A

AWS Data Pipeline

35
Q

What is Amazon MSK?

A

Fully managed streaming service for running data streaming applications that leverage Apache Kafka

Provides control-plane operations (creates, updates, deletes clusters as required)

Leverage the data plane for Kafka for producing and consuming streaming data

36
Q

What is Amazon MSK best used for?

A

Existing applications that need to leverage the open source version of Apache Kafka

37
Q

What is a Broker Node in Amazon MSK?

A

Specify the number of broker nodes per AZ you want at the time of cluster creation

38
Q

What is a Zookeeper Node in Amazon MSK?

A

Zookeeper nodes are created for you

39
Q

What are Producers, Consumers and Topics in Amazon MSK?

A

Kafka data-plane operations allow creation of topics and ability to produce/consume data

40
Q

What are Flexible Cluster Operations in Amazon MSK?

A

Performs cluster operations with the console, CLI, or APIs within the SDK

41
Q

How resilient is Amazon MSK?

A
  1. Automatic Recovery (detection and recovery from common scenarios with minimal impact)
  2. Detection (detects broker failures result in mitigation or replacement of unhealthy nodes)
  3. Reduce data (tries to use storage from older brokers during failures to reduce data needing replication)
  4. Impact time is limited to however long it takes Amazon MSK to complete detection and recovery
  5. After recovery, producer and consumer apps continue to communicate with the same broker IP as before
42
Q

What is Amazon MSK Serverless?

A

A cluster type within Amazon MSK offering serverless cluster management (automatic provisioning and scaling)

Fully compatible with Apache Kafka

43
Q

What is Amazon MSK Connect?

A

Allows developers to easily stream data to and from Apache Kafka clusters.

44
Q

What are the security and logging features of Amazon MSK

A
  1. Integrates with Amazon KMS for server-side encryption requirements
  2. Encryption at rest is the default
  3. Uses TLS 1.2 for encryption in transit between brokers in clusters
  4. Broker logs can be delivered to CloudWatch, S3, and Kinesis Data Firehose
  5. By default, metrics are collected and sent to CloudWatch
  6. All API calls are tracked using CloudTrail
45
Q

What is Amazon OpenSearch?

A

Managed service allowing you to run search and analytics engines for various use cases

It is the successor to Amazon Elasticsearch Service

46
Q

What are the features of Amazon OpenSearch?

A
  1. Quick Analysis (quickly ingest, search and analyze data in your clusters; commonly part of an ETL process)
  2. Highly scalable services
  3. Ties into IAM, encryption at rest and field-level security
  4. Multi-AZ capable with master nodes and automated snapshots
  5. Allows you to use SQL for BI apps
  6. Easily integrates with CloudWatch, CloudTrail, S3 and Kinesis
47
Q

If you are given a scenario on the exam that talks about creating a logging solution involving visualization of the log file analytics of BI reports, what service might be included?

A

Amazon OpenSearch

48
Q

What is the predecessor to Amazon OpenSearch?

A

Amazon ElasticSearch (the same concepts appy)