Data model Flashcards Preview

Distributed System > Data model > Flashcards

Flashcards in Data model Deck (19)
Loading flashcards...
1

Data Model is most important

How to model data (as Data structures/objects) is important

2

Each "layer" component may model same data differently

each layer hides the complexity of the
layers below it by providing a clean data model

For "address", at high level model as class/object Address, at DB level model as Table "Address" at HTTP layer model as "JSON/XML" etc

Data Model reflects the problem and its solution

3

Type of Data Models

(SQL) Relational Data Model : good for many to one (upside-down tree)
(noSQL) Document Data Model: good for one to many (tree)
(noSQL) Graph Data Model: good for many to many (DAG)
(noSQL) Full-Text Search Mode: good for full text search (log files)

0. SQL databases target use cases where there is few one to many relationship (mostly many to one) --> SQL

1. Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare. (I.e. such as log files) --> SQL, MapReduce Query

2. Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything. (social network) ---> Graph QL

3. Full-Text search --> Indexing

4

NoSQL

Not Only SQL

5

polyglot persistence (similar to Polygot Programming)

Adoption of multiple persistence data models (SQL, noSQL etc) in an application.

6

Why NoSQL (Document/JSON model)
--Better for One-to-Many (tree structure)

1 person has many job categories (1 category with 1 category ID)

For a complicated object, the "Normalized" SQL split into multiple tables, using Foreign ID to connect the tables (relationships).

To reconstruct the object, a "denormalization" (joining by foreign Ids) need to happen.

If save a object as JSON text inside a column, then the SQL cannot be used to query the content inside JSON, leaving the job to application.


Thus we need NoSQL, that stores data as JSON data model and can use SQL to query

7

Document Database (NoSQL)
vs
SQL tables (SQL)

SQL supports many-to-one, many-to-many (Normalization, JOIN) well, but do not support one-to-many well -> shifting the query of values inside JSON to application

NoSQL supports one-to-many natively (JSON tree) but does not support JOIN (many-to-one) well --> shifting the JOIN operation to user

***PRO
JSON object has better locality. All information about 1 person is in the same row/document.

JSON Document implies one-to-many tree structure

SQL tables: information of 1 person is scattered among multiple tables, needs a JOIN operation

***CON
Normalization avoids duplication, but implies many-to-one relationship.

A JSON document with normalization will have both
One-To-Many and Many-To-One relationships

8

What is Normalization in Database( SQL and NoSQL)
-- avoid duplication
-- single place change

Normalization in Database is avoid duplication of common constant values:

similar to C/C++ to do:
#define REGION_ID_192 "Philadelphia"
#define REGION_ID_190 "Seattle"

in C where we can
1) avoid duplication
2) change at 1 place will change all references to
e.e.
#define REGION_ID_190 "Greater Seattle"
will change all ppl referencing 190 to from "Greater Seattle" area.

9

To Normalize or not to Normalize?

To #define or not #define

10

Normalization implies Many-to-One

Many ppl reference REGION_ID_192 because many ppl are from Philadelphia

11

A JOIN-free document has no reference to entity IDs (no use of #define values)

A JOIN-free document today may not be JOIN-free tomorrow, if new feature added forcing normalization (using of entity IDs)

12

Since NoSQL weak at JOIN forced by Normalization, the debate of NoSQL vs SQL is also a debate of Normalization vs No Normalization

Normalization is An age old debate

13

Data Locality: Document Model vs Relational Model

Document Model: All data in one (json) document ==> In one storage sector ==> Locality in disk, in memory etc

Relational Model: Data is "shredded" into multiple tables. Need to JOIN to get the full data.
==> Multiple tables ==> Multiple storage sectors ==> Not necessarily in close vicinity in disk or memory

14

Data Locality: Document Model vs Relational Model

Document Model: All data in one (json) document ==> In one storage sector ==> Locality in disk, in memory etc ==> less disk seek, page swap etc

Relational Model: Data is "shredded" into multiple tables. Need to JOIN to get the full data.
==> Multiple tables ==> Multiple storage sectors ==> Not necessarily in close vicinity in disk or memory ==> more disk seek, page swap etc

15

Document Model Locality Cons:
1) Read Whole use Part --> a waste
2) In-place update may require relocation of the document (when it exceeds sizes)

some SQL also support locality

16

SQL is a declarative query language (vs imperative language like C/Java/Python etc)

SELECT (cols) FROM (table) WHERE (cond) {TRANSFORM}

e.g
select * from * where a=b SORT BY a ASEC

Another declaration language is CSS
# highlight background of blue of any paragraph that is direct child of a
  • element whose class is set to "selected" like

  • li.selected > p {
    background-color: blue;
    }
  • An imperative language tells the computer to perform certain operations in a certain order (exactly as the order of the statements)

    In a declarative query language, like SQL or relational algebra, you just specify WHAT not HOW

    I.e. specify the pattern of the data you want—what conditions the results must meet, and how you
    want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal.

    The SQL Optimizer will decide HOW, which allows PARALLEL execution (mapreduce for example)

    17

    SQL is a declarative query language (vs imperative language like C/Java/Python etc)

    SELECT (cols) FROM (table) WHERE (cond) {TRANSFORM}

    ORDER of execution:
    WHERE--> TRANSFORM-->SELECT

    e.g
    select * from * where a=b SORT BY a ASEC

    Another declaration language is CSS
    # highlight background of blue of any paragraph that is direct child of a
  • element whose class is set to "selected" like

  • li.selected > p {
    background-color: blue;
    }
  • An imperative language tells the computer to perform certain operations in a certain order (exactly as the order of the statements)

    In a declarative query language, like SQL or relational algebra, you just specify WHAT not HOW

    I.e. specify the pattern of the data you want—what conditions the results must meet, and how you
    want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal.

    The SQL Optimizer will decide HOW, which allows PARALLEL execution (mapreduce for example)

    18

    SQL engine can be implemented on top of MapReduce (for parallel execution), Similarly Pandas/Spark etc call all use MapReduce for parallel execution on data processing.

    MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a
    pipeline of MapReduce operations

    19

    ACID-compliant database (regardless SQL or noSQL)

    (atomicity, consistency, isolation, durability)