Distributed databases Flashcards

1
Q

What is the motivation for distributed databases?

A
  • optimises queries
  • Say you make a query about sales in Manchester, you don’t need to specify I want the table for Manchester then I want to make this query on it, give me the output of that one
  • Instead you’re saying I want the output for here and the DBSM will then get all of this information and pass you the output for the query you requested.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does each site store data about in distributed databases?

A
  • Each site stores only data primarily relevant to it
  • Distributed DBMS provide access to data at all sites. One site needs to be able to access information from another site.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When are distributed databases needed?

A

More general: large organisations/companies
◦ …with different branches or offices or sub-companies
◦ …or simply so large that one computer can’t handle all the request/transactions you want to do, so distributed databases can handle this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why are distributed databases useful for providing access to large datasets to many users?

A

◦ Distribute data over several computers –don’t have to be identical in software or hardware
◦ Computers could be at geographically different physical locations (but also in the same place) depending on what system you’re dealing with.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the advantages of using distributed databases?

A

◦ Balance workload & network traffic, handle multiple queries simultaneously, twice as many operations executed by having two sites.
◦ Easier to extend capacity or scale to higher number of users, just plug in more hardware.
◦ If there’s a physical damage to one site the other locations remain undamaged, for example a fire occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the formal definition of Distributed Databases

A

◦ Collection of multiple logically interrelated databases that are distributed over a computer network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does DDBMS stand for?

A
  • a distributed database management system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what does a graph representing a DDBMS contain?

A

-different sites are known as nodes - they correspond to where the database is stored.
- lines connecting them represent network links. In general you may not have network links between every pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the advantages of DDBMS?

A
  • Performance improvements (quicker querying as each DB is handling less users)
  • Scalability (Easier extension of the system capacity/performance)
  • Resilience (Data can be replicated at geographically separate sites. Catastrophic failures don’t affect the entire system)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the advantages of DDBMSs?

A
  • You need more computer hardware.
  • It’s more complicated
  • It doesn’t scale perfectly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two methods of transparency in DDBMS?

A

Fragmentation and replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is transparency in DDBMS?

A

Keeps data hidden from people accessing the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is fragmentation?

A
  • Taking a database and splitting it into different parts to store it at different nodes#
  • two types horizontal and vertical fragmentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe horizontal fragmentation

A
  • Take some rows out and then you store these rows at different sites
  • so you might have one bunch of rows stored on one side and another bunch of rows at another site
  • The sets don’t have to be disjoint, but it’s fine either way.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you get the original table back after doing horizontal fragmentation?

A
  • You take the union of these fragments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe vertical fragmentation

A
  • Start with a table and we cut different columns out and store them at different sites.
  • With this method the tables should typically have some overlap because the method for restoring the original table is by taking the natural join of all these columns.
  • A good/common method for vertical fragmentation restoration is that all the sites have an overlap of one specific key attribute.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How is the database transparent due to using fragmentation?

A

The user doesn’t actually see all these fragmentations, they just see the full relation when they query R because the DBMS puts all the fragmentations back together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which methods for transparency are the most commonly used?

A

Fragmentation and replication transparency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is typical with fragmentation so a DBMS can put all the fragments back together?

A

Typically tuples are stored at a particular site according to a common value of a specific attribute
- so all rows that have a as their type may be stored at the site in Liverpool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is special about fragmentation?

A
  • You can mix vertical and horizontal fragmentation
  • You can start with vertical then mix in horizontal, or the other way around and do it multiple times
  • it’s most common to only do a few steps of this because whenever you do a vertical fragmentation you have fewer columns, so you can only split it so many times.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why does redundancy improve resilience?

A
  • If we keep multiple copies of the same item, also known as application transparency
  • This improves resilience in case of failures
    Example: we have a distributed db: we run a query about sales in manchester, but if the manchester branch fails we can’t run the query:
  • If we have other sites keep copies of the fragments stored at Manchester, this allows us to answer queries involving data from there.
  • So it allows us to answer the queries of data from systems that have failed by storing copies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why does redundancy improve efficiency?

A

Example: we have a query about suppliers, if other sites keep copies of data about the suppliers, then we may be able to execute it faster by getting different parts of the query from different sites.
- Allows stores to answer queries involving suppliers without establishing a connection to the central office

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How is replication used to provide transparency?

A
  • how many copies you have in your distributed database
  • Controls how many sites keep a copy of a fragment
24
Q

What is full replication?

A

Where a copy of the whole database is stored at every site.

25
Q

What is No replication?

A

◦ Each fragment is stored at a unique site

26
Q

What are the advantages of Full replication?

A

◦ Very faster query answering, because you can connect to a random server and that server will have everything you need to answer the query
- don’t need to wait for a particular server to be available
◦ Very slow updates: because every time there is an update, you have to update every single server

27
Q

What are the advantages of No replication?

A

◦ Updates are very fast as you’re only updating one place.

28
Q

What are the disadvantages of No replication?

A

◦ Crashes are a big problem because then there’s no way to get the answer to a query.
◦ Queries are slower however because if you need information stored on another server you must wait to for this information

29
Q

What is partial replication?

A

◦ Limit number of copies of each fragment, you may not be able to answer every query.
◦ Replicate only some fragments, so you may choose to replicate the most important fragments or the most frequently accessed fragments

30
Q

What are the four levels of transparency in order of highest to lowest?

A

Fragmentation transparency (Highest level)
◦ is transparent to users
◦ Users pose queries against the entire database
◦ The distributed DBMS translates this into a query plan that fetches the required information from appropriate nodes

Replication transparency
◦ Ability to store copies of data items / fragments at different sites
◦ Replication is transparent to users

Location transparency
◦ The location where data is stored is transparent to the user, this is similar to replication transparency but it’s more specialised than fragmentation transparency.

Naming transparency
◦ You need to ensure that a given name (of a relation) has the same meaning everywhere in the system, so for each instance of a relation the name must be the same.

31
Q

What is the issue of Concurrency control in DDBMS

A

Concurrency control is the part for ensuring ACID properties of Isolation and consistency are retained. One way of doing this is using locks for full isolation/consistency.

32
Q

What is one way of handling locks in for concurrency control in DDBMS?

A

To have one master computer doing all the locks so this one computer determines should I grant this request for lock or not.

33
Q

What are the two issues with having a master computer dish out locks for concurrency control in DBMSs

A

1: if master computer fails then you’ll need to restart the entire system and all running
transactions because you don’t know who had the locks when this computer failed.
- Can have a backup system running. If this is in place we won’t need to restart everything. You must keep the primary and backup computers synchronised or you can have inaccurate backup info. However synchronisation is very expensive

  1. If too many transactions require locks at the same time it will make it too difficult for one master computer to handle.
    - One way of dealing with too many transactions is to have more computers, each being the authority of a different item (locks)
    - However it may not be clear anymore who you need to ask to get a lock.
34
Q

How is Concurrency control based on voting carried out?

A

◦ Each site with a copy of an item has a local lock it can grant transactions for that item
◦ If a transaction gets over half the local locks for an item, it has a global lock on the item
◦ If so it must tell the sites with a copy that it has the lock
◦ If it takes too long to get/announce it has the lock, it must stop trying to get the lock and abort this part.

35
Q

What is the advantage of Concurrency control based on voting ?

A

Much more distributed than the non-voting approach as it doesn’t matter how many go down because if at least half are running you can still access each item

36
Q

What is the disadvantage of Concurrency control based on voting ?

A

Requires much more communication between computers in networks and this takes a long time

37
Q

What is recovery responsible for handling in ACID?

A

atomicity and durability

38
Q

How do we do a global transaction (a transaction that uses information from multiple sites)

Example query: at central office you want to figure out how much inventory there is for some product x at each site, then move this product between the different stores to balance inventory:

A
  • Start our local transaction at the central office T0
  • Then we instruct the other sites to start local transactions 1,2 & 3
  • These smaller transactions, 1,2 & 3 then figure out how much inventory they have on site and send this information back to T0
  • T0 then determines how to move the product between the different sites, and tells them how to move the product (not physically but how the database is changed)
39
Q

What is the difference between atomicity and global atomicity?

A

◦ Can assume Atomicity is enforced at each node locally, so can say that T1 is either fully executed or not executed at all
◦ Could be violated globally
At individual site, atomicity could be satisfied but this can’t ensure overall because T1 and T3 could work but T2 could fail and have to be rolled back, but this would then mean that T1 and T3 even despite being run successfully also have to be rolled back because we want the global transactions to be fully executed or not executed at all

40
Q

What is Distributed Commit

A

A protocol designed to ensure that either everybody commits or no-one commits

41
Q

What is the Phase Commit Protocol (different to the two phase lock)

A
  • This protocols job is to commit actions globally
  • We have a designated node that decides if/when local transactions can commit
  • We do logging at each node locally and you also lock the messages that get sent from other nodes and the ones we send to other nodes
42
Q

What is the first phase of the Phase Commit Protocol?

A

Phase 1: Decide when to commit or abort
- Coordinator sends “prepare T” to all involved nodes.
- Each node decides if right now they are ready to commit or not.
- If a node is ready it goes into the pre-committed state, sends “ready T” to coordinator
- When in a pre-committed state, you’re not allowed to abort
- If a node isn’t ready to commit, sends “don’t commit T” and aborts the local transaction
- If one node aborts then all the other nodes must abort too
- Timeout is used in the case of a delay that states if a node doesn’t answer within this time then default to failed and abort.

43
Q

What is the second phase of the Phase Commit Protocol?

A

Coordinator waits for responses of the nodes
- assumes a timeout means a node wishes to abort
- If node responds “don’t commit” or there’s a timeout, Coordinator sends “abort T” to all nodes.
- If every node responds with “Ready T”, every node has decided to commit
- Coordinator sends “commit T” to every node

44
Q

What is the first logging phase of the Phase Commit Protocol?

A
  • Before coordinator sends “prepare T” you write it in log file i buffer
  • If a node sends “ready T” then it must first enter the pre-committed state, you have to make sure all log entries are written to disk
  • If we see that the last entry in the log is ‘Don’t commit T’ then we know we wanted to abort, so we just abort afterwards.
45
Q

What is the second logging phase of the Phase Commit Protocol?

A
  • Two cases, either all nodes are committing or aborting
  • In both cases we write to the log file what we’re doing, either COMMIT T or ABORT T
  • If COMMIT T is the last thing in the log file and there’s a failure then you should redo the transactions both locally and other places as well
46
Q

What is the three phases commit protocol?

A
  • Fixes a small issue of Two Phase Commit
  • ensures DDBMS are consistent and reliable
47
Q

What are the three phases of the Three-Phase Commit Protocol?

A
  • the prepare phase
  • the commit phase
  • the finish phase
48
Q

Describe the prepare phase in the three phase commit protocol

A
  • each node involved writes its changes to a log file in buffer and sends <ready> to the coordinator</ready>
49
Q

Describe the commit phase in the three phase commit protocol

A
  • if all nodes involved send <ready> the coordinator sends <commit> to all nodes</commit></ready>
50
Q

Describe the finish phase in the three phase commit protocol

A
  • each node that receives <commit> writes changes to the disk and sends ACK to confirm success</commit>
  • if at this point a node does not send an ACK, coordinator sends <abort> transaction is rolled back</abort>
51
Q

What problem does 3 phase commit solve in 2 phase commit?

A
  • If in phase 2, the coordinator and some transaction crash, while everybody else is in the pre-committed state, we have the problem that nothing can be done until the coordinator or this crashed transaction recover
  • We can either try to abort all transactions or try to commit all transactions but this doesn’t work.
  • If we try to get all transactions to commit, then the crashed transaction might be in the middle of or already have aborted and this can break durability
  • If the durability of a database is compromised, it may be difficult or impossible to recover lost data, which can lead to data loss, corruption
  • And if everyone tried to abort (apart from the crashed nodes) the issue could then be that the crashed transaction/s might have previously been told to commit and they might have done so but when they come back online they have to rollback and we’re not allowed to rollback committed transactions, so we’ll be breaking durability
    In either case we’re breaking durability so we have to:
    Leave everything in the pre-committed phase with all the items staying locked, this is extremely expensive:
52
Q

What is an example of doing a naive join in a distributed database?

A

-we have two relations each placed at different sites, R stored at A and B stores S
- At B, we make the query of R natural join S
- have to send over all of R from site A
-the connection between the two sites might be very slow, so the more data we have to send over the longer this is going to take. So we want to minimise the data we’re transferring by only sending what we need.

53
Q

What is a semi join if R on S?

A

R semijoin S is the set of all tuples in R that NATURAL JOIN at least one tuple in S

54
Q

How do you compute a natural join between two sites in a DDBMS storing a table at each site

A
  • So we have site A storing R and site B storing S and we want to compute the left semijoin at site A and then send this over to site B
  • So we send over the distinct common attributes of R and S from S to site A from B
  • then site A computes the left semijoin of R on S = R’ and we send this back to site B. These are the relevant tuples needed.
  • It then sends these tuples back over to site B where it can compute R’ natural join S

Runtime cost:
Costs time of first sending S’ times however big each tuple in S’ is
+ R’ times however big each tuple in R’ is

55
Q

How does a semijoin reduction improve efficiency of joins in DDBMS?

A
  • So is this more efficient than the other method of just sending the entire table over.
  • It depends on if for instance the projection(distinct) is much smaller than the full relation, so many duplicates get eliminated during the process then this can be more efficient
  • It can also be that the size of the semi join is much smaller than the size of R and again this will typically be more efficient
  • In general, the size of the common attribute in S + the size of the semijoin R on S should be smaller than R and in that case, it’s more efficient than sending over all of S