Data Engineering - Streaming data for ML Flashcards

1
Q

Name the 4 kinesis services

A

Data Streams, Video Streams, Data Firehose + Data Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define a data producer

A

Produces streaming data as JSON objects or blobs as the data is generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give examples of a data producer

A
  • IoT devices
  • Manufacturing devices
  • user interaction with a website or a video game
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the two methods data producers can write to Kinesis

A
  1. They can use Kinesis Producer Library (KPL) - a Java library for writing to Kinesis
  2. Use the Kinesis API
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe Kinesis Data Streams

A

They get data from producers and then transfer it to consumers as shards

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Guve examples of Data Consumers

A

Lamdba, EC2, Kinesis Data Analytics and EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can data be sent from KDS directly to a data repository?

A

No, it first needs to go to a consumer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define a consumer

A

AWS service or distributed Kinesis application that retrieves data from KDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define a shard

A

the base throughout of KDS. Data consumers retrieve data from all the shards in a stream the data has generated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain partition keys

A

Data producers assign partition keys to records. Partition keys determine which shard ingests the data record from the data stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define a data stream

A

A logical grouping shards. They retain for between 24 hours and 7 days depending on retention settings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do data consumers use to consume data from KDS?

A

Java Kinesis Client Library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How many records per second can each shard hold?

A

1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can an individual shard be identified?

A

It has a unique partition key and esch data record has a sequence number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is re-sharding?

A

Changing the number of shards that the KDS has after start-up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When do you use KDS?

A
  • Transfering data into AWS to be processed by data consumers
  • Data must be temporarily stored in case it needs to be reprocessed
  • Data needs to be processed before it can be stored
  • Real time analytics by data consumers
17
Q

Describe Kinesis Video Streams

A

Processes video streams fropm connected devices. The data can be sent directly to data consumers to process or a data repository.

18
Q

When do you KVS?

A
  • When you need to collect video streaming data for processing and real-time analysis
  • Batch-process and store streaming video
  • Feed streaming data ionto other AWS
19
Q

Describe Kinesis Firehose

A

For recieving massive streaming data and stroing it in an AWS repository

20
Q

When do you use Data Firehose?

A
  • sending data directly to a data repository sithout processing
  • when the final destination is S3
  • when data retention is not important
21
Q

Does data firehose have internal storage?

A

No, Firehose has no shards and so no data retention or storage

22
Q

Which other Kinesis service can Firehose be combined with?

A

It can be used as a producer for Kinesis Analytics

23
Q

Describe Kinesis Data Analytics

A

Used to query and analyse streaming data using SQL

24
Q

When should you use KDA?

A
  • When you need to take action in real time
  • When you need to organise, enrich and transform
25
Q

Where can KDA accept streams from?

A

Kinesis Firehose
Kinesis Data Streams

26
Q

Describe Apache Kafka

A

A publish/subscribe messaging system with storage. The sender or producer sends the message ti Kafka which then stores the data for a specified amount of time. The reciever subscribes tia topic and then recieves the data they want.

27
Q

How can Kafka be setup?

A

It can be installed on EC2 instances or an Amazon service

28
Q

When should you use Kafka?

A
  • streaming ingest
  • ETL
  • CDC
  • Big data ingest
29
Q

Which kinesis service can store data internally?

A

KDS

30
Q

Which Kinesis services cannot write directly to storage ?

A

KDA and KDS

31
Q

Which Kinesis services can write directly to storage?

A

KDF and KVS

32
Q

Which Kinesis service can process data using Lambda

A

KDF

33
Q

Which Kinesis services can change data?

A

KDF and KDA

34
Q

Which data repositories can KDF send data to directly?

A
  • S3
  • Redshift
  • Elastic Search
  • Splunk
35
Q

Which Kinesis services can perform ETL pre-processing on data?

A
  • KDF using Lambda
    KDA real-time using SQL
36
Q

What types of data can KVS process?

A

radar, audio, video and images

37
Q

What is the advantage of KPL over Kinesis API?

A

KPL provides a lot of added features such as failed transmission built in. These need to be coded yourself when using ghe Kinesis API

38
Q

What are the pros of KPL?

A
  • Performance benefits
  • Consumer-side ease of use via the KCL
  • Producer monitoring via CloudWatch
  • Asynchronous architecture, KPL has a buffer to store records whilst they are processed