Big Data Lecture 03 Cloud Storage Flashcards
What is the issue with big datasets?
Do not fit on a single machine, e.g. the Sloan Digital Sky Survey dataset has 273TB of data in 680 000 directories and 176 000 000 files.
What things are broken in the world of NoSQL?
<ul><li>Relational integrity,</li><li>domain integrity,</li><li>atomic integrity (1st normal form),</li><li>2nd/3rd/Boyce-Codd normal form.</li></ul>
What new data properties are added in NoSQL?
<ul><li>Heterogenous data (no schema),</li><li>nested data (break atomic integrity),</li><li>denormalized data (no normal forms).</li></ul>
Describe the tech stack of big data systems.
UI
Querying
Data Stores
Indexing
Processing
Validation
Data models
Syntax
Encoding
Storage
What is ETL?
When loading data into traditional database, we need to Extract, Transform and Load (ETL it).
What is a data lake? What file operations is it meant for?
As opposed to traditional database, it reads data directly from the file system.<br></br><br></br>Meant for reading, querying not for editing.
How are files stored in a file system?
File content is stored in blocks, usually of 4kB, if a file is not exactly of this side, a whole block is taken up anyway.
What networks does local storage use?
<ul><li>Local machine,</li><li>LAN (local area network), NAS (drive on network),</li><li>not WAN (wide area network).</li></ul>
Principle of scaling from simple data storage into data lake?
<div>Simplify!</div>
<ul><li>Throw away folder structure, use flat objects,</li><li>give data unique ID (key-value model).</li></ul>
How to scale a system expensively and cheaply?
<ul><li>Expensively: scale up - buy a larger, stronger machine,</li><li>cheaply: scale out - buy many cheap machines,</li><li>be smart! Optimize your code!</li></ul>
What are the constraints on the data centers? What are the number of machines and cores?
Due to electricity grid and cooling:<br></br><ul><li>1000 - 100 000 machines in a data center,</li><li>1-200 CPU cores per machine.</li></ul>
How big is a local storage, memory and bandwidth per server?
<ul><li>1-30 TB of storage,</li><li>16GB-24TB of RAM,</li><li>1-200 Gbit/s.</li></ul>
How are servers stored in a data center?
They are in server racks, one rack has 42 rack units.<br></br><br></br>This ensures modularity, as we can stack servers storage and routers into the same rack.<br></br><br></br>Each unit has 1-4 rack units.
Describe S3 data storage model.
<ul><li>Data is stored in buckets, each has a (worldwide) unique ID,</li><li>files (max. 5 TB) are stored as objects in the buckets, denoted by (in-the-bucket) unique ID.</li></ul>
What guarantees does S3 offer in SLA (service level agreement)?
<ul><li>Durability: 99.999999999% (lose 1 in 1011 objects),</li><li>availability: 99.99% (down 1h/year),</li><li>response time: < 10 ms in 99.9% of cases (not mean or average).</li></ul>
Explain CAP theorem.
Impossibility triangle - storage system cannot be:<br></br><ul><li><i>C</i>onsistent (all data agree in all backups and versions),</li><li><i>A</i>vailable (reachable with low latency),</li><li><i>P</i>artition tolerant (breaking up network),</li></ul><div>all at the same time.</div>
What are REST APIs?
<b>Representational state transfer</b>: peer-2-peer HTML-style protocol for file transfer.
How are resources reffered to? What are parts of it?
Using URI (uniform resource identifier), which has<br></br><ul><li>scheme: https</li><li>domain: www.example.com</li><li>path: api/collection/foo/object/bar</li><li>query: ?id=foobar</li><li>fragment: #head</li></ul>
What HTTP methods are there and what do they do? Are they idempotent?
<ul><li>GET: obtains resource,</li><li>PUT: stores resource,</li><li>DELETE: deletes resource,</li><li>POST: anything else.</li></ul>
<div>Only POST is not idempotent.</div>
How does the HTTP protocol work?
<ul><li>Request is send with header and body,</li><li>Response is issues with status code, header and body.</li></ul>
Do URIs on Cloud Storage use file structure with slashes?
No, but you can use slashes to create logical structure for yourself.
Do data centers get filled up to full?
No, they are filled up to 70-80% then resources are rellocated / new center has to be built.
What is intra-stamp replication?
Synchronous method of duplication of data, done on client upload.
What is inter-stamp replication?
After user has finished uploading, the resources are duplicated to different places in the data center asynchronously.
2. increase resilience to natural catastrophes.
2. but with smaller objects,
3. and no metadata.
- simplicity,
- only eventual consistency,
- increased performance,
- scalability.
The range is cut up / merged and the data is transferred.
The data is redundant, the node ranges overlap (over N ranges).
- Chord: finger tables (powers of two of what is where).
- Dynamo: preferrence lists (every node knows about the ranges of all other nodes).
- N - number of data duplicates,
- R - number of nodes each node reads from,
- W - number of nodes each node write to (synchronously).
- First load balancer assigns a random node to be asked,
- the random node figures out who is the coordinator (e.g first node with data) and asks them,
- coordinator redirects the request to N-1 nodes hosting replicas.
-: no lookup or search, data integrity, security issues.
There can different nodes writing, and they increase their own counter at each time.
The vectors form a directed acyclic graph (DAG), so on comparison of two versions, if we cannot compare (all are maximal elements, but not suprema) we need to merge (done by user).
Azure has all hardware being able to do everything, and doing a bit of everything.
- Objects, are denoted by account, container and blob,
- there are 3 types of blobs, Block Blob (file, at most 190.7 TB), Append Blobs (at most 195 BG)
and Page Blobs (for storing and accessing the memory of virtual
machines).