Section - Big Data Flashcards

1
Q

The 3 V’s of Big Data?

A
  • Volume
    • Ranges from terabytes to petabytes of data
  • Variety
    • Includes data from a wide range of sources and formats
  • Velocity
    • Business require speed.
    • Data needs to be collected, stored, processed and analyzed within a short period of time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Redshift?

A
  • Redshift is a fully managed, petabyte-scale data warehouse service in the cloud
  • It’s a very large relational database traditionally used in big data applications.
  • Redshift is incredibly big- it can hold up to 16 Petabyte of data.
  • Redshift is not a high availabity service, it only runs in zone.
  • Automatic backups are retained for 1 day but can be extend to 35 days
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an ETL?

A
  • Extract
  • Transform
  • Load
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is AWS Elastic Map Reduce(EMR)?

A
  • EMR is a managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase,Flink,Hudi and Presto.
  • It’s AWS’s ETL tool.
  • It’s an open-source cluster (Fleet of EC2 instances)
  • EC2 Rules Apply
    • You can use Reserved Instances and Spot instances to reduce your cost.
  • The architecture lives inside a VPC.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is AWS Kinesis?

A

Kinesis is originally a Greek word, meaning the movement or motion. Amazon Kinesis deals with data that is in motion, or streaming data.

Streaming Data?

Data generated continuously by the thousands of data sources, which typically send in the data records similtaneously and in small size(kilobytes)

  • Financial Transactions
  • Stock prices
  • Game data (as the gamer plays)
  • Social media feeds
  • Location tracking data (Uber)
  • IoT sensors
  • Clickstream
  • Log files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 4 core service of AWS Kinesis?

A
  • Kinesis Video Streams
    • Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
  • Kinesis Data Streams
    • Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.
  • Kinesis Data Firehose
    • Capture, transform, load data streams into AWS data stores to enable near-real-time analytics with BI tools
  • Kinesis Data Analytics
    • Analyze, query and transform streamed data in real-time using standard SQL. Store the results in an AWS data store.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kinesis Data Streams?

A
  • Producers
    • Devices which produce data for streaming
    • e.g. IoT device
  • Kinesis Streams
    • Data is stored in Shard
    • Data is stored for 24 hrs, with a max of 7 days retention
  • ​Consumers
    • Consume stored data and apply business logic
    • e.g EC2 instance, Lambda functions …
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS Kinesis Shards?

A

Kinesis streams are made up of shards, each shard is a sequence of one or more data records and provides a fixed unit of capacity.

  • Five reads per second
  • The max total read rate is 2MB per second
  • 1,000 write per second
  • The max total write rate is 1 MB per second

NB: The data capacity of the stream is determined by the number of shards. if the data rate increases, you can increase capacity on your stream by increasing the number of shards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Kinesis Data Firehose?

A
  • Producers
    • Devices produce data
    • e.g IoT
  • Kinesis Firehose
  • No shards
  • No data retention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

AWS Kinesis Data Analytics?

A
  • Producers
    • Devices produce data
    • e.g. IoT
  • Data is pushed to Firehose
  • You can run SQL query against incoming data and store the results.
  • Real-time analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AWS Kinesis Exam Tips?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

AWS Kinesis Video Streams?

A

Securely stream video from connected devices to AWS

  • Videos can be used for analytics and machine learning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

AWS Kinesis Shards and Consumers?

A
  • The kinesis client library running on your consumers create a record processor for each shard that is being consumed by your instance
  • If you increase the number of shards, the KCL will add more record processors on your consumers
  • CPU utilisation is what should drive the quantity of consumer instances you have, NOT the number of shards in your Kinesis stream.
  • Use an Auto Scalling group, and base the scaling decisions on CPU load on your consumers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is AWS Athena?

A
  • Athena is an interactive query service that makes it easy to analyze data in S3 using SQL.
  • This allows you to directly query data in your S3 Bucket without loading it into a database.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is AWS Glue?

A
  • Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data.
  • It allows you to perform ETL workloads without managing underlying servers.
  • It replaces EMR … serverless architech
  • Glue structures the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Exam Tips: Glue and Athena?

A
  • Serverless SQL
    • It’s the only service that allows you to directly query your data thats’s stored on S3.
  • Both Athena and Glue are fully managed by AWS
  • Glue can help design a schema for your data
21
Q

What is AWS QuickSight(Like PowerBI)?

A
  • Amazon QuickSight is fully managed business intelligent (BI) data visualization service.
  • It allows you to easily create dashboards and share them within your company.
22
Q

What is Elasticsearch?

A
  • Amazon Elasticsearch service is fully managed version of open-source application Elasticsearch.
  • It allows you quickly search over your stored data and analyze the data you get back.
  • it’s commonly used as part of Elasticsearch, Logstash, Kibana(ELK) stack.