L9 - Cloud Storage Systems Flashcards

1
Q

4 types of AWS storage

A
  • Amazon Elastic Block Storage (EBS)
  • Amazon EC2 Instance Storage
  • Amazon Elastic File System (EFS)
  • Amazon Simple Storage Service (S3)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 types of storage devices for VMs

A
  • Instance volumes
  • EBS volumes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Instance volumes

A

Disks/SSDs attached to physical server
- Optimized for high IOPs rates
- Lost when VM is stopped

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

EBS volumes

A

Service providing volumes (Storage Area Network (SAN))
- can only be mounted to a single VM at a time
- survives stopping or termination of VM
- Boot device lost when VM is terminated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of storage

A
  • Object store (S3)
  • Shared file system (NAS) (EFS)
  • Relational Databases (RDS)
  • NoSQL databases
  • data warehouses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6 characteristics of Cloud Storage Systems

A
  • voluminous data
  • commodity hardware (discrepancy btw. processor speed and storage access time)
  • distributed data
  • expect failures
  • processing by applications
  • optimization for dominant usage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CAP Theorem

A

Consistency, availability, and partition-tolerance cannot be achieved together in a distributed system

consistency (CP) = read returns the last write value (strict)
availability (AP) = all requests are answered in an acceptable time
partition-tolerance = the system continues working even if some nodes are separated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the 3 aspects of the CAP Theorem is essentialal in large scale distributed cloud systems?

A

Partition-tolerance
–> Storage solutions focus on either availability (AP) or consistency (CP)
-> AP systems apply eventual consistency: providing consistency only after a certain time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is S3 Object Storage most used for?

A
  • backup
  • data spread across >= 3 data centers in a region
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data management in S3

A

Two level hierarchy of buckets and data objects
- data objects have a name, blob of data (<5TB) and metadata
- data objects ca be searched by name, bucket name, and metadata BUT NOT CONTENT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

5 storage classes in AWS S3

A
  • standard
  • reduced redundancy (for expected loss)
  • intelligent tiering
  • glacier (retrieval 1-5min)
  • deep archive (retrieval 12h)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data access and versioning and lifecyle in S3

A
  • via Simple Object Access Protocol (SOAP), REST, Bit Torrent
  • data cannot be modified only uploaded, deleted, retrieved
  • versioning possible
  • lifecycle: rules can be set of transition (migration of objects to another storage class; expiration: when an object can be deleted)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Consistency in S3

A
  • When creating new objects the key (name) becomes visible only after all replicas were written (read-after-write)
  • eventual consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Requirements of Google File System (GFS)

A
  • most writes are appending at the end
  • optimized for long sequential and short random reads/writes
  • bandwidth is more important than latency (batch processing)
  • support for concurrent modifications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is it better to put files or larger ones on GFS?

A

Better put larger ones because:

Single master server and many chunk servers
-> large chunks reduce metadata and frequent connections to the chunk server

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is the directory implemented in GFS?

A
  • lookup table
17
Q

What is special about the GFS Architecture?

A

Control and data flow are decoupled

–> Client first contacts the master but then interacts directly with the chunk servers (one is selected as primary and updates the replicas)

18
Q

Data integrity, consistency, and metadata in GFS

A

Data integrity:
- each chunk server keeps a checksum
- corrupted chunks are overwritten with replica
Consistency:
- concurrent writes and appends to chunks
Metadata:
- master server contains metadata about all chunks
- each chunk server stores metadata and checksum

19
Q

7 system interaction steps in GFS

A
  1. Client asks maser for all chunkservers
  2. Master grants a new lease on chunk, version # of all chunks increased
  3. Client pushes data to all servers
  4. Client sends write request to primary
  5. Primary forwards write request to secondaries
  6. Secondaries reply to primary upon completion
  7. Primary replies to client with success or error (if write at primary succeeds but fails at secondaries)
20
Q

How is system interaction for appends in GFS?

A

Same as before, but in step 4 (4. Client sends write request to primary) the primary checks if appending to the current chunk exceeds the max size of 64MB. If it exceeds then the chunk is padded.

21
Q

Limitations of GFS

A
  • scalability of the single master –> partitioning of the file system and development of distributed master
  • 64MB chunk size (e.g. google mail has much smaller files)
  • no latency guarantees
22
Q

Characteristics of Amazon Elastic File System (EFS)

A
  • distributed
  • capacity: unlimited file system size, individual files 48TB
  • automatics provisioning of capacity
  • integrated lifecycle management (infrequent files moved to cheaper storage)
  • parallel access up to 1000s of EC2 instances
  • throughput scales with the file system
  • Aggregated IOPS scales with # of threads accessing EFS
  • many security measures available
23
Q

Consistency in AWS EFS

A
  • close-to-open consistency: Any changes are flushed to the server on closing the file, and a cache revalidation occurs when you re-open it
  • EFS can provide stronger consistency with read-after-write consistency (strict consistency)
24
Q

What are the ACID properties of Relational Databases (RDB)?

A

ACID:
- Atomic: the set of operations is executed successfully or it does not change anything
- Consistency
- Isolation: during the execution of the transaction no intermediate status is visible to the outside
- Durability: result is stored persistently

25
Q

Are RDB designed for vertical scaling?

A

Yes

26
Q

Characteristics of Amazon Aurora

A
  • Amazon’s own RDB as an alterative to mySQL
  • fully managed
  • database instance up to 64 TB
  • low price
  • 6 copies of data are replicated
  • automatic backup in S3, scaling
27
Q

NoSQL storage

A
  • Schema-free: Easy to incorporate changes in applications
  • Support for non-relational data
  • designed for horizontal scaling (automatic distribution)
28
Q

Types of NoSQL databases

A
  • Key-value database
  • Document-oriented (JSON)
  • Graph
  • Column-family
29
Q

Amazons’s NoSQL database is?

A

Amazon Dynamo

30
Q

What is Amazon Dynamo?

A

NoSQL, Key-value database
- optimized for small requests, quick access, high availability
- fault-tolerant
- automatic scaling of tables
- support for ACID transactions
- fine-grained access control for tables

31
Q

DynamoDB

A
  • decentralized architecture and eventual consistency
  • stores key-value pairs in a table
  • schema-less
32
Q

Management of Partitions in Dynamo

A
  1. Mapping of keys to partitions
    - keys are hashed
    - hash space is treated as a ring
  2. Mapping partitions to nodes
    - ring is split into segments that are handled by virtual nodes
    - hashing the key and going clockwise determines the responsible virtual node
  3. Virtual nodes are assigned to physical nodes
    - ensures heterogeneity of physical nodes
33
Q

What is (N,R,W) replication in Dynamo?

A

Replication (N,R,W)
- to N consecutive nodes
- if read is successful on R copies, it is overall successful
- same for write on W copies

-> ensures that the replicas are on distinct physical nodes

34
Q

What is the typical replication configuration in Dynamo?

A

(3,2,2) -> R+W>N
–> ensures that the most recent written info is returned (strongly consistent reads -> always the latest value is returned)

35
Q

What can N, R, W be used for in Dynamo?

A

To meet the SLA requirements of the service
- N (# of consecutive nodes) determines the durability
- R and W determine the latency

36
Q

How is failure handled in Dynamo?

A
  • Gossip protocol: once a node stops responding, other nodes will eventually propagate knowledge of the failure
  • Admin can replace the node
37
Q

What is used to handle failures?

A

Replication

38
Q

Comparison of S3, EBS, EFS

A

https://cloud.netapp.com/blog/ebs-efs-amazons3-best-cloud-storage-system