- Object store (S3) - Shared file system (NAS) (EFS) - Relational Databases (RDS) - NoSQL databases - data warehouses

- When creating new objects the key (name) becomes visible only after all replicas were written (read-after-write) - eventual consistency

L9 - Cloud Storage Systems Flashcards by Paolo Oppelt

4 types of AWS storage

Amazon Elastic Block Storage (EBS)
Amazon EC2 Instance Storage
Amazon Elastic File System (EFS)
Amazon Simple Storage Service (S3)

How well did you know this?

Not at all

Perfectly

2 types of storage devices for VMs

Instance volumes
EBS volumes

How well did you know this?

Not at all

Perfectly

Instance volumes

Disks/SSDs attached to physical server
- Optimized for high IOPs rates
- Lost when VM is stopped

How well did you know this?

Not at all

Perfectly

EBS volumes

Service providing volumes (Storage Area Network (SAN))
- can only be mounted to a single VM at a time
- survives stopping or termination of VM
- Boot device lost when VM is terminated

How well did you know this?

Not at all

Perfectly

Types of storage

Object store (S3)
Shared file system (NAS) (EFS)
Relational Databases (RDS)
NoSQL databases
data warehouses

How well did you know this?

Not at all

Perfectly

6 characteristics of Cloud Storage Systems

voluminous data
commodity hardware (discrepancy btw. processor speed and storage access time)
distributed data
expect failures
processing by applications
optimization for dominant usage

How well did you know this?

Not at all

Perfectly

CAP Theorem

Consistency, availability, and partition-tolerance cannot be achieved together in a distributed system

consistency (CP) = read returns the last write value (strict)
availability (AP) = all requests are answered in an acceptable time
partition-tolerance = the system continues working even if some nodes are separated

How well did you know this?

Not at all

Perfectly

Which of the 3 aspects of the CAP Theorem is essentialal in large scale distributed cloud systems?

Partition-tolerance
–> Storage solutions focus on either availability (AP) or consistency (CP)
-> AP systems apply eventual consistency: providing consistency only after a certain time

How well did you know this?

Not at all

Perfectly

What is S3 Object Storage most used for?

backup
data spread across >= 3 data centers in a region

How well did you know this?

Not at all

Perfectly

Data management in S3

Two level hierarchy of buckets and data objects
- data objects have a name, blob of data (<5TB) and metadata
- data objects ca be searched by name, bucket name, and metadata BUT NOT CONTENT

How well did you know this?

Not at all

Perfectly

5 storage classes in AWS S3

standard
reduced redundancy (for expected loss)
intelligent tiering
glacier (retrieval 1-5min)
deep archive (retrieval 12h)

How well did you know this?

Not at all

Perfectly

Data access and versioning and lifecyle in S3

via Simple Object Access Protocol (SOAP), REST, Bit Torrent
data cannot be modified only uploaded, deleted, retrieved
versioning possible
lifecycle: rules can be set of transition (migration of objects to another storage class; expiration: when an object can be deleted)

How well did you know this?

Not at all

Perfectly

Consistency in S3

When creating new objects the key (name) becomes visible only after all replicas were written (read-after-write)
eventual consistency

How well did you know this?

Not at all

Perfectly

Requirements of Google File System (GFS)

most writes are appending at the end
optimized for long sequential and short random reads/writes
bandwidth is more important than latency (batch processing)
support for concurrent modifications

How well did you know this?

Not at all

Perfectly

Is it better to put files or larger ones on GFS?

Better put larger ones because:

Single master server and many chunk servers
-> large chunks reduce metadata and frequent connections to the chunk server

How well did you know this?

Not at all

Perfectly

How is the directory implemented in GFS?

Study These Flashcards

lookup table

What is special about the GFS Architecture?

Study These Flashcards

Control and data flow are decoupled

–> Client first contacts the master but then interacts directly with the chunk servers (one is selected as primary and updates the replicas)

Data integrity, consistency, and metadata in GFS

Study These Flashcards

Data integrity:
- each chunk server keeps a checksum
- corrupted chunks are overwritten with replica
Consistency:
- concurrent writes and appends to chunks
Metadata:
- master server contains metadata about all chunks
- each chunk server stores metadata and checksum

7 system interaction steps in GFS

Study These Flashcards

Client asks maser for all chunkservers
Master grants a new lease on chunk, version # of all chunks increased
Client pushes data to all servers
Client sends write request to primary
Primary forwards write request to secondaries
Secondaries reply to primary upon completion
Primary replies to client with success or error (if write at primary succeeds but fails at secondaries)

How is system interaction for appends in GFS?

Study These Flashcards

Same as before, but in step 4 (4. Client sends write request to primary) the primary checks if appending to the current chunk exceeds the max size of 64MB. If it exceeds then the chunk is padded.

Limitations of GFS

Study These Flashcards

scalability of the single master –> partitioning of the file system and development of distributed master
64MB chunk size (e.g. google mail has much smaller files)
no latency guarantees

Characteristics of Amazon Elastic File System (EFS)

Study These Flashcards

distributed
capacity: unlimited file system size, individual files 48TB
automatics provisioning of capacity
integrated lifecycle management (infrequent files moved to cheaper storage)
parallel access up to 1000s of EC2 instances
throughput scales with the file system
Aggregated IOPS scales with # of threads accessing EFS
many security measures available

Consistency in AWS EFS

Study These Flashcards

close-to-open consistency: Any changes are flushed to the server on closing the file, and a cache revalidation occurs when you re-open it
EFS can provide stronger consistency with read-after-write consistency (strict consistency)

What are the ACID properties of Relational Databases (RDB)?

Study These Flashcards

ACID:
- Atomic: the set of operations is executed successfully or it does not change anything
- Consistency
- Isolation: during the execution of the transaction no intermediate status is visible to the outside
- Durability: result is stored persistently

Are RDB designed for vertical scaling?

Yes

Characteristics of Amazon Aurora

- Amazon's own RDB as an alterative to mySQL - fully managed - database instance up to 64 TB - low price - 6 copies of data are replicated - automatic backup in S3, scaling

NoSQL storage

- Schema-free: Easy to incorporate changes in applications - Support for non-relational data - designed for horizontal scaling (automatic distribution)

Types of NoSQL databases

- Key-value database - Document-oriented (JSON) - Graph - Column-family

Amazons's NoSQL database is?

Amazon Dynamo

What is Amazon Dynamo?

NoSQL, Key-value database - optimized for small requests, quick access, high availability - fault-tolerant - automatic scaling of tables - support for ACID transactions - fine-grained access control for tables

DynamoDB

- decentralized architecture and eventual consistency - stores key-value pairs in a table - schema-less

Management of Partitions in Dynamo

1. Mapping of keys to partitions - keys are hashed - hash space is treated as a ring 2. Mapping partitions to nodes - ring is split into segments that are handled by virtual nodes - hashing the key and going clockwise determines the responsible virtual node 3. Virtual nodes are assigned to physical nodes - ensures heterogeneity of physical nodes

What is (N,R,W) replication in Dynamo?

Replication (N,R,W) - to N consecutive nodes - if read is successful on R copies, it is overall successful - same for write on W copies -> ensures that the replicas are on distinct physical nodes

What is the typical replication configuration in Dynamo?

(3,2,2) -> R+W>N --> ensures that the most recent written info is returned (strongly consistent reads -> always the latest value is returned)

What can N, R, W be used for in Dynamo?

To meet the SLA requirements of the service - N (# of consecutive nodes) determines the durability - R and W determine the latency

How is failure handled in Dynamo?

- Gossip protocol: once a node stops responding, other nodes will eventually propagate knowledge of the failure - Admin can replace the node

What is used to handle failures?

Replication

Comparison of S3, EBS, EFS

https://cloud.netapp.com/blog/ebs-efs-amazons3-best-cloud-storage-system

L9 - Cloud Storage Systems Flashcards

(38 cards)