NoSQL Databases and Dynamo DB Flashcards

1
Q

What is Dynamo DB (DDB)?

A

NoSQL (non-relational), DBaaS product within AWS; typically used for Serverless or Web-Scale applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

DDB Specs:

A

→ No self-managed servers or infra (like Aurora and RDS which are DB services that sit on servers you manage); delivered XaaS

→ You can control performance/capacity manually (Provisioned), or have it done automatically (On-Demand) where the system will scale as needed
○ Adding capacity means adding more SPEED/PERFORMANCE

→ Highly resilient - spans across Multiple AZ’s

→ Does encryption of data at rest, and performs backups and point in time recovery of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the base entity that makes up DDB?

What are the 2 key options called?

A

Tables

Primary Key & Sort Key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Primary Key?

What is the Sort Key?

What is an Item?

A

Primary Key - uniquely identifies each item in a table

Sort Key - minor key that further identifies the item. Sort Keys can share a common Primary Key but there can be only ONE Primary Key per item.

Item - basically the unit that get’s written to DDB and input into a table. EX) day of the week+all associated information for that day, when looking at a Weather Table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 2 types of Backups in DDB?

A

On-Demand - full backup of the table that is retained until you manually remove the backup.

Point in Time - Performs a continuous record of changes over a 35-day recovery window period. This is applied on a table-by-table basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DDB Summary:

A

→ If the use case is NoSQL, then it’s likely DDB

→ Relational Data use case is NOT going to be DDB solution (that would be RDS or SQL)

→ If you see any mention of Key/Value and DDB is a possible answer, then it’s probably DDB

→ Access is via Console, API, or CLI (you cannot use SQL or any query language since the DB is not relational)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DDB Tables are broken into which 2 types of capacity units?

A

Read (RCU)

Write (WCU)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 2 modes that a Table can be created in?

A

On Demand - Used when you have an unknown or unpredictable load on the table.

Provisioned - you set the capacity value. This is when you know exactly how much load and capacity you’ll need for a given table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How many KB is 1 x RCU operation?

How many KB is 1 x WCU operation?

What is the minimum cost for any operation?

A

1 RCU = 4KB

1 WCU = 1KB

1 RCU & 1 WCU minimum for an operation always round up to the nearest RCU/WCU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 2 options for Query operations when performing a query?

A

Query Ops

Scan Ops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens in a Query Operation?

What happens in a Scan Operation?

A

Query

  • when a query is done you start with picking a SINGLE particular Partition Key Value
  • The operation can return zero items, one item or multiple items but you still only pick one value for the partition key
  • The capacity consumed is the total of the size of all returned items i.e how much it costs to READ all the items.

Scan

  • less efficient but more flexible; you have complete control what data get selected and returned; like a filter. You don’t have to pick a single Partition and optional Sort Key
  • caveat is that Scan consumes entire capacity of table - so while you only get 2 rows back, you still pay for all 5 if it’s a 5-row table for example
  • it does this because it scans the entire table for the exact value(s) we’re looking for to then present it back
  • very expensive from a Capacity perspective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Consistency?

A

The process of how newly updated data is read.

Is the data being read immediately the same data as what was put into the recent update? OR is it eventual, over a period of time the same data but not immediately?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 2 types of Consistency models?

A

Eventual Consistency - easier to implement and scales better; data gets scaled out and a READ operation might not show the updated data instantaneously

Strong Consistency - essential in some types of apps but is harder and more costly to achieve - data is instantaneously updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Storage Leader Node in a Redshift ARCH?

A

A “Leader Node” will be selected between the Storage Nodes; this Leader Node is where WRITES occur (any change/update to the data set/table).

Once the Leader Node has the data written to it, it become “Consistent” .. once Consistent it then starts the replication process to the other nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a DDB Stream?

A

A time ordered, list of changes that will be applied to a DDB table i.e any updates/deletes will get added to the stream to then be applied to the DDB Table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 4 view types that a Stream can be configured with?

A
  1. Keys Only - only shows the Partition and/or Partition + Sort Keys
  2. New Image - shows state of the ITEM after the change
  3. Old Image - shows the old image type before the change occurred so you can then compare it to new ITEM to see what changed
  4. New and Old Images - shows both side by side
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a DDB Trigger?

A

An ITEM change within a Table, input into a Stream, will generate an event which can result in a corresponding action by Lambda.

DDB Trigger = Streams + Lambda

18
Q

DDB Trigger/Trigger Architecture Summary:

A

→ You use streams + Lambda so that a lambda function is invoked whenever specified for changes occur to a DDB table

→ The Trigger is the compute action that occurs based on the data change

→ Using Streams and Triggers allows you to respond to an event as it happens, and only consume the minimum amount of compute required to perform the action

→ We use Streams and Lambda together to implement a “Trigger ARCH” for DDB

→ Lambda is the compute piece (like Compute as a Service) that handles the action once it is triggered

19
Q

What is a DDB Index?

A

A way to improve the efficiency of retrieval operations within DDB.

Indexes are basically an alternative view on the table data to enhance Query Operations; when you perform a query the data will come back in an alternative view.

This helps you avoid using a Scan Operation which eats the RCU’s in total for the whole table; different teams within the Org might want different view depending on their job role.

20
Q

What are the 2 types of Indexes in DDB?

What are the main aspects of each?

A

→ Local Secondary Indexes (LSI) - allows you to view the table with a different Sort Key

  • created @ the time a Table is initially created
    • up to 5 LSI’s per table
  • ** uses Shared Capacity settings of the Table

→ Global Secondary Indexes (GSI) - allows you to view the table with a different Partition Key and Sort Key; can be created @ any time

    • up to 20 GSI’s per table
  • ** uses its own Capacity settings for the Table
21
Q

Index Considerations:

A

→ Use GSI’s by default; only use LSI’s when strong consistency is required
○ GSI’s are a lot more flexible and can be created after the point that a base table is created

→ Use indexes for alternative access patterns
○ When you create a Base Table - you choose the Partition and Sort Keys ahead of time for the primary way you will view and access the data in the table
○ Indexes offer an alternative perspective to that for any alternative access patterns

→ EX) a different team might be interested in different attributes; all data is kept in the same place, but can be access from various perspectives that are more relevant to different teams

22
Q

What is a DDB Global Table?

A

Feature that provides multi-master global replication of DynamoDB tables which can be used for performance, HA or DR/BC reasons

All tables are the same i.e there is no Primary and Secondary tables – each table has full Read/Write replication.

23
Q

What is DDB Accelerator (DAX)?

A

An in-memory cache designed specifically for DynamoDB which greatly improves the performance of DDB.

** It should be your default choice for any DynamoDB caching related questions **

** Supports WRITE-THROUGH and READ-CACHING **

24
Q

How is DAX deployed?

A

DAX is a fully managed, in-memory caching cluster that sits inside a VPC and has direct access to DDB.

There is also a piece of SW (the DAX SDK) gets installed directly onto an application.

Instead of the app having to re-send a request to DDB for data after checking the in-memory cache (if the cache doesn’t have it), DAX handles that for the application. DAX either returns the data from it’s cache or fetches it + caches it from DDB on behalf of the App

25
Q

Benefits of DDB Accelerator (DAX)?

A

→ Reduces the time for READ operations; very ideal for data that is Read over and over again at a high frequency (maybe during a peak season or something;

→ Read-heavy workloads; this would normally eat up a ton of RCU’s but since the data is cached in memory, we avoid reading from DDB

26
Q

What is Athena?

A

Serverless querying service which allows for ad-hoc questions/queries on large amounts of data

EX) Take data stored in S3 and perform ad hoc queries on that data.

27
Q

Whats the process to change the output of information as it’s read from Athena?

A

By creating a Schema. This is called “Schema on Read”

The Schema basically translates the S3 data into the view, ad-hoc, for how you want to view it

28
Q

Athena Summary:

A

If data is stored in S3, and the data is structured/semi-structured/unstructured, and you need to perform ad-hoc queries, then Athena is the service you’ll want to use.

The “Read-on-Schema” feature allows the data to be queried in a relational-style way, using normal SQL-like queries; the data can be saved or sent to other AWS tools.

29
Q

What is Elasticache?

A

Managed in-memory caching service which allows Apps to scale to very high levels of performance.

30
Q

Elasticache Overview:

A

** REVIEW ** Databases store data persistently on disks.. So no matter how fast the disks can spin, there is always going to be a performance limit or ceiling for the DB

→ Elasticache in comparison is a cache engine which holds data in MEMORY; it’s a DB alternative that’s used for apps that require a very high level of performance

→ Elasticache can be used with DDB and RDS, but DAX is managed in-memory caching that’s specific to DDB

○ Much faster in terms of throughput and latency

○ Takes a lot of load off the application itself – if you have thousands of users accessing the same data, they’ll hit Elasticache instead of always going to the App and corresponding DB

31
Q

What are some use cases for Elasticache?

A

○ Used to cache data for READ-HEAVY workloads

○ Workloads with LOW LATENCY requirements

○ Can be used to store session data for users of an application, which allows the application to then be STATELESS

** EX) Session data is loaded into Elasticache externally to application instances; this allows the app to be Fault Tolerant (step above HA where if components of the app fail, the user will not notice)

32
Q

Can your apps in AWS use Elasticache natively?

A

NO.

→ if you want to use Elasticache, you have to change the code on your application(s) that will be using the service – you can’t just start using the service, your app needs to “understand” the caching architecture that’s provided.

→ The app needs to know to use the cache to check for data; if the data isn’t there (“Cache Miss”) then it needs to check the underlying DB (like Aurora DB or RDS) (“Cache Hit”)

→ This functionality does NOT come for free

33
Q

What are the 2 options for caching engines in Elasticache?

A

Redis - advanced data structures + HA + backups

MemcacheD - simple data structures + no HA + no Backups

34
Q

What is Redshift

A

Redshift is a Pedabyte-sized Data Warehousing service.

A Data Warehouse is a location where many different operational DB’s from across your business can pump data into for long-term analysis and trending

Redshift is designed for reporting/analytics, NOT operational usage. You do not perform WRITES directly to Redshift, only Reads.

35
Q

What type of DB is Redshift? What type is RDS?

A

OLAP, which is used for complex queries of historical data.

RDS (and others) are OLTP, which is used for capturing, storing, and processing data in real time.

36
Q

What is Redshift Spectrum?

A

Feature which allows you to query data on S3 without actually loading any data in beforehand.

Typically with Redshift, you’d load all the data in ahead of time before it’s worked on. This feature avoids having to pre-load data from S3.

37
Q

Redshift ARCH Review

A

○ Redshift is provisioned on servers i.e it’s not Serverless or used for ad-hoc queries (that would be Athena)

○ Uses a cluster architecture, which is a PRIVATE network; multiple nodes and high-speed networking between them
§ Runs in ONE AZ so not HA by design

○ Each cluster has a “Leader Node” which is the one you interact with
§ Anything outside of the cluster that interacts with Redshift, will interact with the Leader Node

○ Compute Node runs the actual queries and is instructed by the Leader Node; Leader Node manages the distribution of work
§ Compute nodes are divided up into slices where each slice is given a portion of the node’s overall memory and disk space to process the workload assigned to it

§ Storage is also attached to the compute nodes

38
Q

What is Enhanced VPC Routing?

A

Feature turned on in Redshift to allow for customized networking requirements.

39
Q

Is Redshift resilient?

A

No - it operates out of a single AZ. If that AZ fails, the Redshift cluster fails.

40
Q

How can you design Redshift to be resilient?

A

Backup the data to S3 via Automatic or Manual snapshots.

When it’s a restored, a new RedShift cluster will be created, where you can pick the AZ and a different Region.