Chapters 3&4 Knowledge Testers Flashcards
Questions at the end of each section (33 cards)
Do you know how storage and memory technologies (HDD, SDD and RAM) compare in terms of capacity and throughput?
Capacity (least to most): RAM, SSD,HDD
Throughput: (slow to fast): HDD,SSD,RAM
Do you know the difference between data and metadata?
metadata: small relational table with the following attributes
- name of the file, access rights, owner, group of owner, last modification time, creation time, size
Data: content of the files
Order of magnitude that can be achieved in terms of number of objects, and object size?
S3 - 100 buckets, object is 5TB, chunk is 5GB
Can you name a few big players for cloud-based object storage (vendors, consumers)?
Amazon S3, Microsoft Azure
Can you describe the features of S3 and Azure Blob Storage on a high level? Do you know what a bucket and object are? What block blob storage, append blob storage and page blob storage are and how they work?
S3: buckets that contain objects (up to 5TB, fit on single disk)
- no hierarchies for objects
Azure: public documentation, IDed by account, container, blobs.
- organized in storage stamps (10-20 racks) 30PB
- exposes more details to users
- block: store data
- append: for logging
- page: for storing and accessing the memory of virtual machines
Describe what the most important SLA (Service Level Agreement) parameters mean (e.g., latency, availability, durability) as well as their typical range?
latency: how fast your data will reach you. response times (eg. 99.9% of requests will be served in 30 seconds) S3 has no latency guarantees
availability: How often your data will be available for you (99.99% of year, 52 minutes with no availability)
durability: how much data will be lost (less than 1 in 1 billion or 99.999999999% not lost (nine 9s after decimal))
Do you know what each letter stands for in CAP?
Consistency: (Atomic) at any point in time the same request to any server returns the same results (all nodes see same data)
Availability: system available at all times for requests
Partition tolerance: System continues to function even if network linking its machines is occasionally partitioned
Can you explain why, for large amounts of data, CAP becomes rel- evant over ACID?
ACID: for data all stored on one machine for relational databases
CAP: For big data where many machines are used and where replicas of nodes must be made and updated
Can you explain what a REST API is, what resources and methods are?
data stores ecpose their functionality through APIs. REST = REpresentational State Transfer. “HTTP” done right”.
Resources: anything: document, pdf, person. refered to with a URI (uniform Resource Identifier)
Methods: normal methods/functions that do stuff. Get, put, delete, post.
Can you describe a typical use case for object storage?
shopping carts for large online stores
Can you explain the difference between block storage and object storage?
object -> flat, key-value pairs
HUGE (billions/trillions) amounts of big objects (5TB)
block -> hierarchies
->a lot (Millions) of HUGE files (>5TB)
Can you explain the difference between the (logical) key-value model and a file system?
key-value: flat no hierarchies
file system: hierarchies
Do you know the order of magnitude of a block size for a local filesystem and for a distributed file system? Can you explain the rationale behind them with respect to latency and throughput?
Local: 4kB
DFS block size: 64 or 128 MB
large enough that time is not lost in latency waiting for a block to arrive
small enough for a large file to be conveniently spread over many machines (parallel access) to improve throughput. Also small enough that if there is an error and they must be sent again it can be
Can you contrast centralized architectures to decentralized (peer-to- peer) architectures?
Decentralized: all machines can talk to each other.
Centralized: there is a central machine. all other machines talk to the central machine, one node is special (queen bee).
the others nodes are basic bitches and interchangeable
Can you explain the HDFS architecture, what a NameNode is and what a DataNode is, how blocks are replicated?
HDFS: distributed files system: hierarchy of files over multiple machines
NameNode: the special node that all other nodes communicate with. stores namespace, mapping from file to list of its blocks, mapping from block to locations of replicas
DataNode: store data blocks on local disk, communicate with namenode through regular heartbeats. (datanode always initiates contact). datanodes communicate with each other through replication pipelines
Blocks are replicated through replication pipilines between datanodes. blocks are sent in smaller packets in a streaming fashion. The original node sends the first copy of the data to node #2, node #2 then propogates on. Original node does NOT send data to all the nodes.
Can you sketch how the various components communicate with each other (client, NameNode, DataNode)?
NameNode: never initiates communication (like women w sex). namenode responds to heartbeats
DataNode: communicate with namenode through regular heartbeats (always initiates). communicate with each other through replication pipelines
Blocks are replicated through replication pipilines between datanodes. blocks are sent in smaller packets in a streaming fashion. The original node sends the first copy of the data to node #2, node #2 then propogates on. Original node does NOT send data to all the nodes.
Client communicates with namenode and datanode
Can you point to the single points of failure of HDFS and explain how they can be addressed?
If the namenode fucks off were screwed. Keep track of edit log, merge into a bigger log periodically. We use a “phantom NameNode” that keeps the exact same structures in its memory as the real NameNode, and performs checkpoints periodically. It is possible to set it up so the phantom NameNode to be able to instantly take over from the NameNode in case of a crash.
Can you explain how the NameNode stores the file system namespace, in memory and on disk? In particular, can you explain how the namespace file and the edit log work together at startup time and how they get modified once the system is up and running?
HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. This is stored on NameNode. The file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. The snapshot and edit log are stored either locally or on a network- attached drive (not HDFS itself). For more resilience, they can also be copied over to more backup locations. If the NameNode crashes, it can be restarted, the snapshot can be loaded back into memory to get the file namespace and the mapping of the files to block IDs. Then the edit log can be replayed in order to apply the latest
changes
Can you explain what a Standby NameNode is? (Note: it has many predecessors that only have historical relevance in the development of HDFS: Backup NameNode, Secondary NameNode, Checkpoint NameNode, etc, but this is not important for the course)
We use a Standby NameNode keeps the exact same structures in its memory as the real NameNode, and performs checkpoints periodically. It is possible to set it up so the standby NameNode to be able to instantly take over from the NameNode in case of a crash.
where does HDFS shine and why?
PBs of data. Write once, read many. fault tolerance. throughput
Do you know that HDFS files are updated by appending atomically and why?
simplifies consistency, aids in batch processing, simplifies logging
Do you know how HDFS performs in terms of throughput and latency?
optimized for high throughput
What are the main benefits of HDFS?
handles massive files, streaming, scalable, fault tolerance, high availability, works well with scaling out
Describe the limitations of traditional (local) file systems?
local file systems must fit on one machine, creating size constraints. Single disks cannot store big datasets.