Databases Flashcards
Databases (43 cards)
What is an efficient method to jointly index multiple attributes in a database, and how does it improve query performance?
An efficient method is composite indexing where a single index is created on multiple columns like (Department, salary). increases performance by reducing query search space. Instead of having to go through irrelevant rows of data. The more queried attribute is put first int he index it speed up the query even more. Multiple attributes now accessed together.
Increases storage space and introduces redundancy. Therefore the updates and insertions are slower as the data needs to be changed in multiple areas
What indexing strategy can be used to jointly index attributes
A1, A2….An
for efficient queries involving selection criteria on multiple attributes?
Use composite index that combines attributes a1, a2, an into a single index. can have custom indexes for the most queried ones like A1, A2, A3. more efficient of the most queried one is first or if the query has multiple of these attributes. Quicker search times, but uses more storage and updates are now slower as multiple values need to be updated
In hashing what is a collision?
A collision occurs when hashing 2 different keys results in the same index in the hashing table.
Why can the simple collision resolution policy (CRP) result in poor search performance?
Colisions can occur due to a badly designed hash function.
Hash collisions increases the time it takes to search insert and delete.
This is because when a collision occurs it has to search for the next available slot which adds additional steps to the agorithm.
Collisions also create clusters of data that decreases performance as when a key is hashed to one of these clusters it needs to search past these clusters to insert.
Hash collisions create an uneven distribution in the data.
- What is Linear Probing?
And why do buckets need to be marked as tombstones when items are deleted from them?
Linear probing is a collision resolution strategy used in hashing. When two keys hash to the same index, linear probing resolves the collision by searching for the next available slot (bucket) in the hash table. When it reaches the end of the table it wraps around to the start. They are prone to clustering issues as when there is a grouping of data it takes longer to find a free slot. This increase the time to insert etc.
Buckets need to be marked as tombstones when items are deleted from them to ensure that when searching the search doesn’t think the its over and stop. The value might be below/after this tombstone bucket
- What is Dynamic About Extendible and Linear Hashing?
Dynamic hashing techniques like extendable hashing and linear hashing adapt the hash tables size dynamically to handle a increasing number of inputs/keys. Unlike static hashing where the hash table size is fixed these methods grow or shrink the hash table as needed.
How does Extendable and linear hashing differ?
Extendable hashing:
uses a directory to map hash keys to buckets.
The directory doubles in size when bucket overflows.
Only the overflowing bucket splits.
Required more memory for the for the directory but allows for fast access
Linear hashing:
does not use a directory.
Expands the hash table incrementally, splitting buckets sequentially when a bucket exceeds its overflow.
Offers a more gradual growth process consuming less memory during reorganization.
Outline an algorithm for deleting an item from a linearly hashed file.
Deleting an item from a linearly hashed file requires careful handing to maintain the tables integrity and the sequence of the buckets for correct lookups.
Steps to delete
use the hash function to locate the bucket.
search for the item in that bucket.
if found remove and mark spot as a tombstone to ensure the search still works.
Describe the structure of a b+ tree.
A b+ tree is a self balancing tree extended from the b tree. It ensures all values are stored in leaf nodes while internal nodes re used for indexing and traversal.
Root Node:
Contains pointers to child nodes
Internal nodes:
Contains pointers to child nodes
serve to shrink the search space.
Leaf Nodes:
Stores the actual data.
Linked together in a doubly linked list to allow for double traversal
The keys in the leaf nodes are in sorted order
Properties of a b+
Balanced structure
Efficient range queries
Search efficiency
In relation to a b+ tree write high level pseudocode for returning all values of Ai for some defined range.
- Start at the root of the B+ tree.
- Traverse down the tree:
- Follow the correct child pointers in internal nodes based on the lower bound of the range (low).
- Stop when a leaf node containing keys near “low” is reached.
- Scan the leaf nodes:
- Collect all keys in the current leaf that fall within the range [low, high].
- If the current leaf contains a key greater than “high,” stop.
- Move to the next leaf node if needed:
- Use the linked list of leaf nodes to continue scanning for more keys within the range.
- Stop when all keys up to “high” have been collected.
- Return the list of keys found within the range.
What are the motivations for using a dynamic hashing approach?
Also name any approach to hashing a dynamic file
dynamic hashing can dynamically adapt the hash table size, handling Growing datasets as opposed to static hash table sizes in static hashing.
Space efficiency
Adaptability
Minimized overhead as it reduces the manual intervention to resize or rehashing.
Type of Dynamic hashing:
Linear hashing is a type of dynamic hashing.
Explain what the algorithm for the b+ tree does to Insert values
First locate the leaf node.
Start at the root and traverse down to the appropriate leaf node where the key should be inserted.
use the internal nodes to decide which child pointer to follow.
Insert into the leaf Node.
Insert the key value in sorted order. within the leaf node if there is space.
If there is no space and there is overflow.
split the node into 2 nodes
move the middle key to the parent key.
If the parent overflows recursively split and propagate the middle upwards
Explain what is meant by the lost update problem and show with an example how a problem could arise
A lost update problem occurs when 2 or more transactions modify a single piece of shared data without proper synchronization. Typically happens when there is a lack of concurrency control.
when multiple transactions read a value and then update it based on the old value.
T1 read 100
T2 reads 100
T1 Adds 85 and writes 185
T2 Adds 50 and writes 150 instead of both going through one goes through and the other is lost.
What is Two Phase Locking (2PL). And with an example of transactions show how it would proceed under 2PL.
2PL is a concurrency control protocol.
2 Phases
Growing phase:
A transaction acquires locks on the data items it needs
Once the transaction releases the lock it cannot acquire more locks.
Shrinking phase:
The transaction releases lock as it finishes using the items
During this phase no new lock can be acuired
Transactions must acquire share or exclusive locks with data in order to read and write. These locks are then released and other transactions can then access it for reading and writing.
T1 starts: Acquires an exclusive lock on the value (100).
Reads the value (100).
Adds 85 → New value is 185.
Writes 185.
Releases the lock.
T2 starts (only after T1 finishes):
Acquires an exclusive lock on the value (now 185).
Reads the value (185).
Adds 50 → New value is 235.
Writes 235.
Releases the lock.
Prove that Two Phase Locking guarantees Conflict-serializability.
in 2PL there are 2 phases
Growing phase:
Transaction acquire all required locks without releasing any
Shrinking phase: Once a transaction releases a lock it cannot acquire any new locks.
This 2 Phase locking system ensures partial order on transactions based on the locking order.
This partial order prevents cycles in the precedence graph/Conflict graph. This attribute of being Acyclic (containing no cycles) is a must for being Conflict-Serializable.
This locking system creates a dependency so that t2 can only proceed when the lock form t1 has been released the lock.
This creates a dependency graph where cycles are not posible
Explain database recovery including system log, commit points, checkpoints.
Database recovery is returning the database to a correct consistent state after a failure like a power cut or crash etc.
Recovery ensures all committed transactions are preserved (REDO) and all incomplete transactions are Removed. (UNDO)
General Approach to recovery
1 System Log:
It records
When a transaction starts
What data is being changed Before value and after value.
What operations have been done
When a transaction finishes
System log helps the database fix errors if it crashes by undoing or redoing changes
2 Commit point:
This is the moment when a transaction is officially finished
Once a transaction reaches its commit point, all its changes are permanent and even if there is a system crash the changes are safe.
3 checkpoint
A save point for the database
It helps speed up recovery after a crash by reducing how much of the system log needs to be checked
Steps to create checkpoint:
Pause all current transactions
save all unfinished changes to disk
write a checkpoint marker in the system log
Resume the paused transactions.
After a crash the database can start recovery from the last checkpoint instead of reloading the last log it has.
Giving an example explain what is meant by a temporary update problem.()
Happens when one transaction updates a value and another reads that value before its committed. if the first transaction is rolled back then the second transaction will be working with the wrong value that it read in from T1 originally.
Example
T1 Reads 100 and +50
T2 Reads 150 and +50
T1 Rolls back
T2 Outputs 200
this is wrong as is known as a dirty read
Define the term minimal cover set and outline an algorithm for finding a minimal cover set of a set of functional dependencies.
Hence find the minimal cover set of A->B, C -> B, D -> ABC, AC ->D.
If F is the original functional dependency then the minimal cover set is a set of simplified equivalent set of dependencies that satisfies these criteria:
1 Every dependency in the set needs to have a single attribute on the right-hand-side.
2. Remove redundant attributes from left hand side.
3. Remove all redundant functional dependencies like A->B, B->C, A->C
Decomposing BCNF does no guarantee dependency-preservation. Explain this term
Dependency-preservation means ensures that all functional dependencies in the original relation are preserved in the decomposed relations.
meaning for every X -> Y in the original relation you should be able to enforce it on one of the decomposed relations without needing to compute a join.
BCNF ensures that the left hand side is a super key.
during this process some of the functional dependencies might not be enforceable without computing a join
Explain what denormalization is. Outline an example. Discuss the tradeoffs and challenges. Talk about data integrity efficiency and storage costs.
Denormalization is the process of combining normalized database tables to improve query performance by reducing the number of joins.
Examples
Normalized example:
Orders(OrderID, CustomerID)
Customers(CustomerID, Name, Address)
Denormalized table:
Orders(OrderID, CustomerID, CustomerID, Name, Address)
Advantages:
Faster query processing
Simplified queries
Disadvantages:
Introduces data redundency
Difficult to update/maintain consistent database when updates need to occur i multiple tables.
Increased storage for the same attribute in multiple tables
In relation to functional dependencies what is a “KEY” and how can it be found for a set of relations.
A key is the minimal set of attributes in a relation that can determine all the other attributes in the relation.
How to find the key:
checking the closure of a set of attributes and seeing if they can determine all attributes in the set.
Test for minimality by remove attributes and seeing if the closure is the whole set of values.
What are the rules for 2NF 3NF and BCNF?
2NF: An attribute that is not part of the candidate key must be functionally dependent on the WHOLE candidate key.
3NF: No transitive dependencies.
BCNF: Every functional dependency has the left hand side as a super-key.
So if A-> B B->C
A is a super-key but B is not so we need to decompose.
Its not R1(A, B) R2(B,C);
What is the incorrect summary problem in relation to concurrency control?
The incorrect summary problem is when a transaction reads and calculates the summary of a dataset while another transaction is concurrently modifying the data.
Can result in an incorrect summary of the data.
Register example:
T1 reads register 1: as $100
T2 register 1: $100- $50
T2 register 2: $200 +$50
T1 reads register 2: as $250
Now there is an incorrect summary of the 2 registers at the end of the day.
What is timestamping and how would a simply temporary update problem would progress with timestamping.
Timestamping is a method in concurrency control to ensure tat transactions occur in a consistency and logical order. Every transaction gets a timestamp when it starts. This timestamp decides its priority (Older transactions with smaller timestamps go first).
A timestamp cant update data if a newer timestamp has updated it. If a newer timestamp updates it the transaction is aborted.
Temporary update example
T1 $100 +$50 = $150
T2 Reads the temp value of $150
T1 Rolls back
T2 Writes the value of $150 which is incorrect.
Here timestamping would not allow the T2 to write that value as its timestamp is newer than T1.
Therefore T2 will be aborted and restarted.
How it Works:
Older transactions go first:
Transactions with smaller timestamps (older ones) have priority.
Rules for Conflicts:
If a newer transaction (with a larger timestamp) updates data, older transactions can’t touch that data anymore.
If an older transaction tries to update something that a newer one has already changed, the older transaction is aborted and restarted.