Week 4 - Parallel Data Architecture Flashcards
Two types of Parallel database system
1) Pipeline Parallelism
2) Partition Parallelism
What is Pipeline Parallelism
Many machines each doing on set in a milt-step process
What is Partition Parallelism
Many machines doing the same thing to different pieces of data
What is Speed up?
More resources means proportionally less time for a given amount of data 45 degree angle
What is scale-up?
If resources increased in proportion to increased data size,time, is constant (no diminishing returns )
When is scale up used in parallel databases?
1) To implement parallelism in databases for faster processing.
2) To have the same performance levels when workloads increase.
3) To break the processing in a sequential manner.
2) To have the same performance levels when workloads increase.
Shared Memory (SMP) means
multiple CPUs that can run things in parallel but they share the same memory space.
Shared Disk
In the shared disk architecture, you have multiple CPUs and
each one has its own memory space.
Shared Nothing
For the shared nothing architecture, multiple CPUs have their own memory space, not only that, they also have their own secondary storage
How do machines communicate using the share nothing
only way the machines communicate with each other is through the network
Advantage of Shared Memory
Easy to program
2 Disadvantage of Shared Memory
1) expensive to build
2) Difficult to scale
2 Advantage of sShared Nothing
1) cheaper to build
2) easier to scale up
Disadvantage of Shared Nothing
Harder to program
Intra-operator Parallelism
Get all machines working to computer a give operation
scan,sort,join
Inter-operator Parallelism
each operator may run concurrently on a different site
exploits pipelining
Inter-query Parallelism
different queries run on different sites
3 Types of data partitioning
1) Range
2) Hash
3) Round Robin
Range Partitioning means
Partitioning data on a machine and doing the processing on that machine (Partitioning based on logical sort of data) Like by Age
Hash Partitioning means
range partitioning runs a hash function,
and the hash function will decide which tuple,
or Retiro in the table will be assigned to which partition.
Round Robin Partitioning means
For each row in the table, you assign it to the first partition.
The second row you assign it to the second partition. And so on, and so forth.
3 Items Parallel Sorting
1) scan in parallel and range-partition as you go (sort attribute)
2) As tuples come in, begin “local” sorting on each
3) Resulting data is stored and range-partitioned
Parallel Sorting Problem
skew!
Some partitions will have more data than others, unbalanced load
Parallel Sorting Solution:
sample the data at start to determine partition points (find data distribution so data can be sorted evenly in partitions)