Advanced Databases Flashcards

(612 cards)

1
Q

What are the types of data?

A

Numeric, character, temporal, spatial, image, text, audio, video

These types represent different formats in which data can exist and be manipulated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is temporal data?

A

Data with a dimension of time

Temporal data can be structured in various ways and can be bounded or unbounded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the types of time density in temporal data?

A

Discrete, Dense, Continuous

Each type has different properties regarding how time is represented and compared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a discrete timeline isomorphic to integers imply?

A

Total order with a finite number of chronons

In a discrete timeline, every pair of integers can be compared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does a dense timeline isomorphic to rational numbers imply?

A

Partial order with an infinite number of chronons

Not every pair of elements can be directly compared in a dense timeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a continuous timeline isomorphic to real numbers imply?

A

Total order with an infinite number of chronons

In a continuous timeline, real numbers maintain a complete ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is granularity in the context of temporal data?

A

Determining the precedence of events

Granularity affects how events are ordered in time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two types of time that can be stored in databases?

A

Valid time, Transaction time

Bitemporal data tracks both valid and transaction time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the types of spatial data?

A

Points, regions, vectors

These types define how spatial data is represented.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What operations can be performed on spatial data?

A

Length, intersect, containment, overlap, centre

These operations allow for manipulation and analysis of spatial relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some applications of spatial data?

A

Computer Aided Design (CAD), Computer generated graphics, Geographic Information Systems (GIS)

These applications utilize spatial data for various purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What key properties are important in spatial data applications?

A
  • Connectivity
  • Adjacency
  • Order
  • Metric relations

These properties help in understanding spatial relationships and structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the characteristics of systems dealing with spatial data?

A
  • Data objects may be highly complex
  • Data volumes may be very large
  • Data may be held in real time
  • Performance is not easy to achieve
  • Access is likely through specialised graphical front ends
  • Query processing will probably not be performed using SQL

These characteristics present challenges in managing and analyzing spatial data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is textual data obtained?

A
  • From machine-readable formats
  • Using OCR techniques

These methods facilitate the conversion of text into a usable format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the nature of text data?

A

Essentially unstructured

An index is needed to make it searchable and usable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are examples of image data?

A
  • X-rays
  • Maps
  • Photographs

Image data can vary greatly in type and application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are binary large objects (BLOBs)?

A

Large data objects stored without attached semantics

BLOBs do not have inherent meaning or context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are essential features of image databases?

A
  • Image analysis and pattern recognition
  • Image structuring and understanding
  • Spatial reasoning and image information retrieval

These features enhance the usability of image data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are common techniques used in audio data?

A

Compression techniques

Compression is necessary for efficient storage and transmission.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is MIDI?

A

A more compact alternative to digitised audio

MIDI consists of a sequence of instructions interpreted by a synthesizer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is video data structured?

A

Images stored as a sequence of frames

Video data integrates video and audio through interleaved file structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does DDL stand for and what is its primary function?

A

Data Definition Language; for creating tables and manipulating the database structure

DDL is essential for defining the database schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is DML used for?

A

Data Manipulation Language; utilised for querying & manipulating data within the database

DML allows users to retrieve or modify the content of the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What components are included on the users’ side of a DBMS?

A

Components that process queries from user or application programs

This includes the user interface and query execution components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the role of the System Catalogue in a DBMS?
Contains metadata about stored data & schemas ## Footnote Metadata includes details like file names, sizes, storage details, and constraints.
26
List some types of information stored in the System Catalogue.
* Names & sizes of files * Storage details of files * Names & data types of data items * Mappings between schemas * Constraints * Statistical information ## Footnote This information is crucial for the management and optimization of the database.
27
What does the DDL Compiler do?
Handles schema definitions by processing and translating them into a format the DBMS can understand ## Footnote It also stores schema descriptions in the system catalogue.
28
What are the main functions of the Query Compiler?
* Parses & validates queries * Compiles queries to an internal form called 'query plan' ## Footnote This ensures that queries are correctly formatted before execution.
29
What is the purpose of the Query Optimiser?
Improves the query plan for efficient execution ## Footnote It optimizes the sequence of operations and eliminates redundant steps.
30
True or False: The Query Optimiser interacts directly with physical storage.
False ## Footnote The Query Optimiser generates executable code based on the optimised query plan without direct interaction with physical storage.
31
What does the Precompiler handle?
A tool that processes embedded SQL statements in a host programming language (like C, Java, or COBOL) before the actual compilation of the program ## Footnote It extracts DML commands and sends them to the DML compiler.
32
What is the role of the DML Compiler?
Processes DML commands and compiles them into executable code ## Footnote This conversion allows commands to be executed by the runtime database processor.
33
What does the Runtime Database Processor do?
Executes queries and manages access to data stored in a database during program execution ## Footnote It interprets high-level database operations (like SQL queries), optimizes them, and coordinates the retrieval or modification of data by interacting with storage and memory subsystems in real time.
34
What is the function of the Stored Data Manager?
Manages the physical storage of data & controls access to it ## Footnote It interacts with the physical storage using basic OS services.
35
List some utility components of a DBMS.
* Loading utility * Backup utility * Recovery utility * File reorganization utility * Performance monitoring ## Footnote These utilities assist with data management and system performance.
36
Fill in the blank: The _______ utility handles failure using backup information.
Recovery ## Footnote This utility is crucial for maintaining data integrity after failures.
37
What is the volatility of Cache?
Volatile ## Footnote Cache is temporary storage that loses its contents when power is turned off.
38
How fast is Cache compared to other storage types?
Very fast ## Footnote Cache operates at speeds that are significantly faster than main memory and secondary storage.
39
What is the cost classification of Cache?
Very expensive ## Footnote The high performance of Cache comes at a high cost.
40
What is the capacity range of Main Memory?
10^9 - 10^10 bytes ## Footnote This indicates the typical storage capacity of RAM.
41
What is the access time for Main Memory?
10^-8 s (20-30 cycles) ## Footnote This is the time it takes to access data in Main Memory.
42
What defines Secondary Storage?
Non-volatile ## Footnote Secondary storage retains data even when the power is off.
43
What is the speed of Secondary Storage?
Slow ## Footnote Compared to Cache and Main Memory, Secondary Storage is much slower.
44
What is the capacity range of Secondary Storage?
10^11 - 10^12 bytes ## Footnote This indicates the typical storage capacity of hard drives.
45
What is the access time for Secondary Storage?
10^-3 s (10^6 cycles) ## Footnote This reflects the time it takes to access data on hard disk drives.
46
What is Tertiary Storage characterized as?
Non-volatile ## Footnote Tertiary storage also retains data without power, like Secondary Storage.
47
How fast is Tertiary Storage?
Very Slow ## Footnote Tertiary storage is the slowest type of storage compared to Cache and Main Memory.
48
What is the capacity range of Tertiary Storage?
10^13 - 10^17 bytes ## Footnote This indicates a very large storage capacity typical of archival systems.
49
What is the access time for Tertiary Storage?
10^1 - 10^2 s ## Footnote This shows the long wait times typically associated with Tertiary Storage.
50
What are Hard Disk Drives primarily used for?
Secondary storage medium for databases ## Footnote HDDs serve as the main storage solution for many database applications.
51
What are tracks on a hard disk?
Data stored in concentric circles ## Footnote Each track can contain multiple sectors.
52
What is a cylinder in the context of hard disks?
The same track on all surfaces of the disk ## Footnote This concept helps in understanding how data is organized across multiple platters.
53
What is Zone Bit Recording?
Tracks closer to the disc edge are longer than those closer to the axis ## Footnote This technique improves data storage density.
54
What are the components of a disk sector format?
* Gap * Sync * Address mark * ECC ## Footnote These components help in organizing and error-checking data on the disk.
55
What is access time in the context of HDDs?
Time taken to access data on a hard disk
56
How does access time differ when reading and writing data?
For writing, must verify the data
57
How does the time to modify a block differ from accessing it?
Process * Read block * Modify in memory * Write block * Verify block
58
What is seek time in the context of HDDs?
Time taken for head assembly to move to a given track ## Footnote Seek time is a significant factor in overall disk access time.
59
Define rotational delay.
Time taken for the desired sector to rotate under the read/write head ## Footnote This is a component of the total access time for HDDs.
60
What is transfer time?
Time to transfer requested data from the disk to memory ## Footnote Transfer time is calculated as block size divided by transfer rate.
61
What is sequential access?
Reading data blocks in a contiguous manner ## Footnote This method minimizes seek time and rotational delay.
62
What does block addressing refer to?
How data blocks are located & accessed on storage devices ## Footnote It is essential for efficient data retrieval.
63
What are the two primary methods of block addressing?
* Cylinder-head-sector * Logical Block Addressing ## Footnote Each method has its advantages and disadvantages in data retrieval.
64
What is the effect of block size selection on I/O efficiency?
* Large blocks reduce access costs but increase irrelevant data * Small blocks increase access costs but reduce irrelevant data ## Footnote Choosing the right block size is crucial for optimizing performance.
65
What distinguishes Solid State Drives from Hard Disk Drives?
SSD doesn't have moving parts ## Footnote This fundamental difference leads to various performance characteristics.
66
How do SSDs compare in cost to HDDs?
More expensive than HDD ## Footnote The technology and performance of SSDs contribute to their higher cost.
67
What is a key performance advantage of SSDs?
Higher I/O performance ## Footnote SSDs typically offer faster read and write speeds compared to HDDs.
68
How do write speeds of SSDs compare to read speeds?
Writes are slower than reads ## Footnote This is a common characteristic of SSD technology.
69
What is buffer management?
Deals with how data is moved between secondary storage (HDD/SSD) and main memory
70
Why is buffer management important?
Main memory is much smaller than secondary storage, but faster, so managing data improves performance
71
What is a buffer pool?
A region of memory used by a database system to cache frequently accessed data pages, reducing the need to read from disk and improving performance.
72
What is the size of each frame in a buffer pool?
The size of a database block plus some metadata
73
What does the pin count in metadata indicate?
How many users are currently accessing the block in that frame
74
What does the dirty flag in metadata signify?
Set to 1 if the copy of the block in the buffer has been changed and not written back to disk; 0 otherwise
75
What is the purpose of access time in metadata?
Used for LRU replacement (OPTIONAL)
76
What is loading time used for in metadata?
Used for FIFO replacement (OPTIONAL)
77
What does the clock flag indicate in metadata?
Used for Clock replacement (OPTIONAL)
78
What happens when a block is requested?
The system checks if it's already in the buffer pool
79
What occurs if the requested block is not in the buffer pool and there is an empty frame?
The block is read into the empty frame and pin count is set to 1
80
What happens if the buffer pool is full when a block is requested?
A frame is selected for replacement using a replacement strategy
81
What must be true for a frame to be replaced?
Its pin count must be 0
82
What is the LRU replacement strategy?
Selects the frame with the oldest access time for replacement
83
What is the FIFO replacement strategy?
Selects the frame with the oldest loading time for replacement
84
What is the Clock replacement strategy?
An approximation of LRU where each buffer is checked in turn and marked for replacement if it hasn't been accessed in a full cycle
85
What is single buffering?
Used to read in one block, process it, read another block into the same buffer, causing potential delays
86
What is the formula for single buffer time?
Single buffer time = n(P + R)
87
What do the variables in the single buffer time formula represent?
* P = time to process a block * R = time to read a block * n = number of blocks
88
What is double buffering?
Uses a pair of buffers to optimize data processing and reading
89
What is the process of double buffering?
While reading a block into buffer A, process block from buffer B; after reading into A, process A and read next block into B
90
What is the formula for double buffer time?
Double buffer time = R + nP
91
What do the variables in the double buffer time formula represent?
* P = time to process a block * R = time to read a block * n = number of blocks
92
What is the Five Minute Rule?
Data referenced every five minutes should be memory resident (stored in main memory)
93
What is the trade-off described by the Five Minute Rule?
Between the cost of keeping a block on disk and the cost of keeping it in RAM
94
Who originally proposed the Five Minute Rule?
Jim Gray & Franco Putzolu in 1987
95
What is the break-even point?
Occurs when the cost of keeping a block on disk equals the cost of keeping it in RAM
96
What is the formula to calculate the break-even point?
X = ($D x P) / (I x $M) ## Footnote X = time interval in seconds $D = cost of a disk unit P = number of pages in 1MB of RAM I = number of I/Os that a disk unit can perform per second $M = cost of 1MB of RAM
97
What do the variables in the break-even point formula represent?
* X = time interval in seconds * $D = cost of a disk unit * P = number of pages in 1MB of RAM * I = number of I/Os that a disk unit can perform per second * $M = cost of 1MB of RAM
98
How has the Five Minute Rule evolved over time?
Changes in technology & cost: 1997: 5 minutes; 2007: 1.5 hours; 2016: 13.5 minutes; Current: DRAM-SSD: 5 minutes, SSD-HDD: 1.5 days
99
What are the energy consumption differences between DRAM, SSDs, and HDDs?
Energy costs of DRAM are much greater than SSDs, and HDDs have greater energy costs than both
100
What are data items in a DBMS?
The basic units of information that a DBMS stores
101
How are integers represented in a DBMS?
Using short integer of 2 bytes
102
What standard is used to represent real numbers?
IEEE 754 floating point standard
103
What coding schemes are used to represent characters?
* ASCII * UTF-8
104
How are booleans represented in a DBMS?
Using 1 byte per value; multiple values can be packed per byte
105
How are dates represented in a DBMS?
As days since a given date or using ISO8601 formats such as YYYYMMDD or YYYYDDD
106
How are times represented in a DBMS?
As seconds since midnight or in ISO8601 formats such as HHMMSS or HHMMSSFF
107
How are strings represented in a DBMS?
In different ways
108
What is the difference between fixed length and variable length data items?
Fixed length: integers & characters; Variable length: strings & bit arrays
109
What are records in a DBMS?
Collections of related data items, aka fields
110
Give an example of a record in a DBMS.
An employee record might consist of a name field, a salary field, and an employment start date field
111
What are the two types of lengths records can have?
* Fixed length * Variable length
112
Give an example of a fixed length record.
An employee record with a 2-byte integer employee number, a 10-character name, and a 2-byte department code
113
Give an example of a variable length record
An employee record containing a number of fields, a code identifying the field (e.g. as an employee number), a type (e.g. integer), a string length and value, and a code identifying another field (e.g. name)
114
What is a header in a record?
The data at the beginning of a record that provides descriptive information about the record
115
What is the smallest unit of data transfer between disk & memory in a DBMS?
Blocks
116
What does a block contain?
* File ID (DB ID, or Relation ID) * Block ID * Record directory * Pointer to free space * Type of block * Pointer to other blocks * Timestamps
117
What are the considerations for placing records in blocks?
* Separating records in a block * Spanned vs unspanned * Sequencing * Indirection
118
What are the three approaches for separating records in a block?
* Use fixed length records * Use a special marker to indicate record end * Give record lengths (or offsets)
119
What is the difference between spanned and unspanned records?
* Unspanned: each record must fit within a single block * Spanned: records may be split between blocks
120
What is sequencing in the context of records?
Ordering records in file (and block) by some key value to optimise disk I/O
121
What is physical addressing?
Record ID directly specifies the physical location of the record
122
What are the advantages of physical addressing?
* Quick access to memory/disk location * Low overhead due to no extra data structures or lookups
123
What are the disadvantages of physical addressing?
* Inflexible if record needs to be moved * Extensive updates required for address changes
124
What is indirect addressing?
Uses pointers to refer to records
125
What are the advantages of indirect addressing?
* Easier to move & rearrange records * More efficient insertion & deletion operations
126
What are the disadvantages of indirect addressing?
* Extra overhead due to pointers * Requires extra step to dereference pointers
127
What is address management in a DBMS?
Handling how data is located & accessed in secondary storage & main memory
128
What does a translation table do in address management?
Records the mapping between DB addresses & memory addresses
129
What is pointer swizzling?
Technique that translates DB addresses to virtual memory addresses
130
What are the three strategies for pointer swizzling?
* Automatic * On-demand swizzling * No swizzling
131
What is unswizzling?
The reverse operation of swizzling
132
What are the two primary approaches to deletion in a DBMS?
* Immediate Space Reclaim * Marking as Deleted
133
What is the challenge with deletion in a DBMS?
Handling dangling pointers
134
What is a tombstone approach in deletion?
Leaves a MARK in the map or old location; physical space is never reused
135
What are the two types of tombstones?
* Physical tombstones * Logical tombstones
136
What does access structure refer to?
The way data is stored, indexed, and retrieved efficiently from the database.
137
What is the purpose of indexing?
Improves query performance & enables efficient data retrieval.
138
What do indexes help locate in a database?
Tuples (records) that match specific criteria.
139
How are relations stored in a database?
In files, which are collections of blocks.
140
What do blocks contain in a database?
Records corresponding to tuples in the relation.
141
What is a sequential file in the context of databases?
Tuples of a relation sorted by their primary key.
142
What is a key characteristic of the storage of tuples in sequential files?
Tuples are distributed among blocks in the order of their primary key.
143
Why is free space often left in each block of a sequential file?
To allow for later insertions.
144
What is the trade-off when maintaining an index?
Query performance vs update cost
145
What is a dense index?
A sequence of blocks holding keys & pointers to all records in the data file.
146
How many key/pointer pairs does a dense index have?
One key/pointer pair for every record in the data file.
147
What is the benefit of having blocks of the index in the same order as those of the data file?
Fewer disk accesses need to be made.
148
What searching method can be used with a dense index?
Binary search for keys as they are sorted.
149
What is a sparse index?
One key/pointer pair for every block in the data file. Uses sequential search with the block to find specific record.
150
What is a key requirement for using a sparse index?
The data file must be sorted by search key.
151
When is a sparse index less suitable?
When precise, quick access to individual records is needed.
152
What is the main advantage of a sparse index over a dense index?
Uses less space (and better for insertions). Also good for low cardinality datasets (duplicate values).
153
What is a multi-level index?
A sparse index of the first index.
154
What is preferred to 3-level indexes?
B-trees
155
What is the benefit of block pointers in sparse indexes?
They can be smaller than record pointers.
156
What are the 2 approaches to handle duplicate keys in a dense index?
Pair for each key/pointer & record. Just point at first instance of key.
157
What are the 2 approaches to handle duplicate keys in a sparse index?
Index contains first key from each block. Index contains first new key from each block (may mean block gets excluded).
158
What happens when deleting a record from a sparse index?
Update the entry in the index after deleting the record.
159
What is a secondary index?
Unlike primary indexes, it doesn't determine placement of records in the data file.
160
What type of pointers do secondary indexes use?
Record pointers, not block pointers.
161
What is a disadvantage of conventional indexes?
Inserts are expensive, and/or lose sequentiality & balance.
162
Fill in the blank: A dense index approach can include a pair for each key/pointer & _______.
record.
163
True or False: Sparse indexes work well for queries retrieving large ranges of data.
True.
164
What is the formula to calculate the offset in a contiguous file?
Offset = (Key Position Difference) x Block Size.
165
What performance characteristic do B+ trees provide?
Consistent performance ## Footnote B+ trees maintain predictable performance regardless of the data distribution.
166
What is sacrificed in B+ trees for the sake of performance?
Sequentiality ## Footnote B+ trees may not maintain the sequential order of data access, but they provide faster search times.
167
Where is the root node of a B+ tree typically kept?
Main memory ## Footnote Keeping the root node in memory allows for quick access to the tree structure.
168
What is the purpose of leaf nodes in a primary index of a B+ tree?
Contain actual data records stored in the order of the primary key ## Footnote This allows for efficient data retrieval without a sequential scan.
169
What do leaf nodes contain in a secondary index of a B+ tree?
Pointers to the data records ## Footnote This allows access to data in the sequence of the secondary key.
170
What kind of node size is there in B+ trees?
Fixed size ## Footnote Fixed size helps in maintaining a balanced structure and efficient space usage.
171
Why is there a minimum node requirement in B+ trees?
To ensure nodes aren't too empty ## Footnote This helps in efficient space utilization.
172
What is the rule regarding the distance of leaves from the root in B+ trees?
All leaves are the same distance from the root ## Footnote This characteristic maintains the balance of the tree.
173
What is the formula for the number of records a leaf node in a primary index can hold?
lp = ⌊(b - p)/r⌋ ## Footnote Where b is the block size, p is the block pointer size, and r is the record length.
174
What is the initial node occupancy typically assumed in B+ trees?
Around 0.6 ## Footnote This allows for some level of expansion in the tree structure.
175
What is the formula for the number of records a leaf node in a secondary index can hold?
ls = ⌊(b - p)/(k + p)⌋ ## Footnote Where b is the block size, p is the block pointer size, and k is the key length.
176
What does the B+ tree insertion case #1 involve?
Space available in leaf (insert 32) ## Footnote This is a straightforward insertion where no restructuring is needed.
177
What happens in B+ tree insertion case #2?
Leaf overflow (insert 7) ## Footnote This requires splitting the leaf node and redistributing keys.
178
What happens in B+ tree insertion case #3?
Non-leaf overflow (insert 160)
179
What happens in B+ tree deletion case #4?
Coalesce with sibling ## Footnote This involves merging with a neighboring node when a node underflows.
180
What happens in B+ tree insertion case #4?
New root (insert 45)
181
What happens in B+ tree deletion case #2?
Coalesce with sibling ## Footnote This involves merging with a neighboring node when a node underflows.
182
What is often preferred over coalescing in B+ tree deletion?
Redistribution ## Footnote Borrowing a key from a sibling is simpler than merging nodes.
183
How are B+ trees different to regular indexing (3)?
* Better performance * Consume more space * Harder concurrency control ## Footnote These features make B+ trees suitable for certain types of database applications.
184
What does a DBA often not know regarding indexing?
* When to reorganize * How full to load pages of new index ## Footnote These uncertainties can lead to inefficient indexing practices.
185
What is a hash function?
A function that takes a key as input and computes an integer value to determine appropriate storage location ## Footnote This integer value is used to index into a hash table.
186
What are buckets in a hash table?
Storage locations that hold records; they point to linked lists of records ## Footnote Keys are sorted in buckets to improve efficiency.
187
What are collisions in hashing?
Instances where different keys hash to the same storage location due to more possible keys than available buckets
188
What is a main memory hash table?
A hash table where data is stored entirely in memory ## Footnote The hash function computes an integer value from a key to determine a bucket.
189
What is a secondary storage hash table?
A hash table where data is stored on disk, allowing for larger datasets ## Footnote Requires disk I/O to access the bucket array.
190
What are the two approaches for using a hash function in secondary storage hash tables?
1. **Direct Mapping**: hash function gives location of bucket on disk, making access fast and predictable 2. **Use a Directory**: hash function gives index into a directory, which points to the actual disk blocks - more flexible
191
What is the recommended space utilization for hash tables?
Between 50% and 80% ## Footnote Lower than 50% indicates wasted space, while higher than 80% indicates significant overflows.
192
What are the two ways to cope with growth in hashing?
1. Extensible hashing 2. Linear hashing
193
What is extensible hashing?
A method that adapts to database growth by adjusting the directory size and splitting buckets as needed
194
What are the key ideas combined in extensible hashing?
1. Using a variable number of bits from the hash function's output 2. Using a directory (array of pointers to buckets)
195
What happens during deletion in extensible hashing?
Blocks can be merged and the directory can be reduced if possible, without merging blocks directly
196
What are overflow chains in hashing?
Structures used when inserting records to handle cases when buckets are full
197
What are the advantages of extensible hashing?
1. Handles growing files without excessive wasted space 2. Avoids reorganizations, reducing overhead
198
What are the disadvantages of extensible hashing?
1. Directory introduces indirection, leading to extra memory access 2. Directory size doubles occasionally, increasing disk accesses and decreasing performance
199
What is linear hashing?
A method that improves on extensible hashing by incrementally expanding the hash file
200
How does linear hashing determine the initial bucket address?
The algorithm looks at a certain number of the smallest bits from the hash result, called i, to decide where to start storing the data.
201
What is the lookup rule in linear hashing?
If h(k)[i] <= m, look at bucket h(k)[i]; otherwise, look at bucket h(k)[i] - 2^(i-1)
202
When should a file be expanded in linear hashing?
When the utilization U exceeds a certain threshold ## Footnote U = #used slots / total #slots
203
What are the advantages of linear hashing?
1. Can handle growing files with less wasted space 2. No full reorganizations required 3. No indirection unlike extensible hashing
204
What is a disadvantage of linear hashing?
Still has overflow chains
205
What is the main advantage of hashing over indexing?
Optimal for single-value lookups as it provides direct access to data based on a key
206
Why is hashing not suitable for range queries?
Hashing does not preserve the order of data
207
What is the advantage of indexing for range searches?
B-trees store data in sorted order, allowing for efficient range retrievals
208
What is the time complexity for lookups in B-trees?
O(log n)
209
What is a characteristic of conventional indexes?
Simple and good for scans as the index is a sequential file; inserts can be expensive
210
What are multidimensional access structures used for?
To efficiently retrieve data based on multiple search keys.
211
What type of queries can multidimensional access structures handle?
* Partial match queries * Range queries * Nearest-neighbour queries
212
In the context of conventional indexes, what is Approach #1?
Get all matching records using an index on one attribute and check values of the other attribute.
213
What is the main drawback of Approach #1 in conventional indexing?
Requires scanning twice.
214
Describe Approach #2 for conventional indexing.
Use secondary indexes on each attribute to get two sets of record pointers and take the intersection of sets.
215
What is a key disadvantage of Approach #2?
The comparison could be quite expensive.
216
What is Approach #3 in conventional indexing?
Use a secondary index on one attribute to select a suitable index on the other attribute.
217
What are grid files similar to?
Like a hash table.
218
How do grid files partition multidimensional space?
Partitions the space with a grid divided into stripes.
219
What is associated with each region in a grid file?
A pointer to a bucket that contains record pointers.
220
What is one advantage of grid files?
Good for multiple-key search.
221
List some disadvantages of grid files.
* Space management overhead * Need partitioning ranges that evenly split keys.
222
What does a partitioned hash do?
Divides a large dataset into smaller subsets (partitions) using a hash function — often to improve performance
223
What is the structure of a kd-tree?
Each node contains an attribute-value pair and 2 pointers.
224
What does each node in a kd-tree do?
Splits the k-dimensional space along a hyperplane.
225
What is a significant consideration when adapting kd-trees to secondary storage?
Keep disk accesses to a minimum.
226
What are the two main types of quad-trees?
* Region quad-trees * Point quad-trees
227
What does an R-tree represent?
Data that consists of k-dimensional data regions.
228
What defines the regions in an R-tree?
Typically defined by top-right and bottom-left coordinates.
229
What is Z-index in the context of UB-trees?
Domain of each attribute mapped onto an n-bit integer.
230
What is a strength of the Z-index?
It converts multi-dimensional data into a single value that can be indexed using conventional methods.
231
What is a limitation of the Z-index?
Small differences in one dimension might not appear as 'close' in the Z-value.
232
What do bitmap indexes represent?
Attribute values using bit-vectors.
233
How does querying with bitmap indexes work?
Combines bit-vectors with bitwise operators (&, |).
234
What is a pro of bitmap indexes?
Efficient answering of partial-match queries.
235
What is a con of bitmap indexes?
Requires fixed record numbers.
236
What is relational algebra?
A mathematical system with operands and operators
237
What are operands in relational algebra?
Variables/values from which new values can be constructed
238
What are operators in relational algebra?
Symbols denoting procedures that construct new values from given values
239
What is the input of a relational algebra operator?
One or more relations
240
What is the result of an operation in relational algebra?
Always a relation
241
How should relational algebra be used?
In a cascaded manner
242
What is a relation in the context of relational algebra?
A subset of the Cartesian product of sets
243
How can a relation R be represented?
In a table with k columns
244
What values are allowed in the ith column of a relation?
Values from domain Di
245
What are attributes in relational data models?
Names of the columns
246
Explain this example (in the context of relations).
* D1: IDs * D2: Names * D3: DeptIDs
247
What are the properties of relations?
* Represents a relation with k attributes * Each value comes from the domain of the corresponding attribute * Each tuple is distinct * Each attribute contains a single atomic value * Ordering of tuples is immaterial
248
What is the named perspective in relational data?
Refers to the idea that every relation (or table) in a relational database has a name and a defined schema (i.e., a set of named attributes or columns)
249
What does the unnamed perspective in relational data suggest?
The ordering of attributes is significant
250
What are the main relational operators?
D1 is the set of IDs, D2 is Names, and D3 is DeptIDs The relation R is a subset of the Cartesian product of these sets and can be represented by this table with 3 columns Values from domain D3 can only be values allowed in the DeptID column (in the DeptID set), like 'ECS'
251
Does commutativity hold for union in relational algebra?
Yes, R ∪ S = S ∪ R
252
Does associativity hold for union in relational algebra?
Yes, R ∪ (S ∪ T) = (R ∪ S) ∪ T
253
Does commutativity hold for difference in relational algebra?
No, R - S ≠ S - R (except if R=S)
254
Does associativity hold for difference in relational algebra?
No, R - (S - T) ≠ (R - S) - T
255
Does commutativity hold for Cartesian product in relational algebra?
No as locations are different, R X S ≠ S X R
256
Does associativity hold for Cartesian product in relational algebra?
Yes, R X (S X T) = (R X S) X T
257
Does distributivity across union hold in relational algebra?
Yes, R X (S ∪ T) = (R X S) ∪ (R X T)
258
What is the renaming operator in relational algebra?
Changes attribute names in a relation
259
How is the renaming operator denoted?
ρ_NewRelationName(R) ## Footnote For relation Students(name,age), ρ_S(n,a)(S) renames Students to S and name and age to n and a
260
What does the projection operator do?
Removes attributes not in the specified list
261
What is the notation for the projection operator?
πL(R)
262
What does the selection operator return?
A subset of the relation where tuples satisfy a predicate
263
What is the notation for the selection operator?
σ_Θ(R)
264
What types of operators can be used in selection predicates?
* Comparison operators (=, <, >, /=, <=, >=) * Boolean logic operators (∧, ∨)
265
What is a multiset?
Like a set, but elements may appear more than once
266
What do relational database binary operations do?
Combine information from two relations into a new relation ## Footnote Core to relational databases, including Θ-Join, Natural Join, Left Outer Join, Semijoin, Antijoin.
267
What is a Θ-Join?
Combines 2 relations using a predicate ## Footnote Equivalent to the Cartesian product followed by a selection using the predicate.
268
What is an Equijoin?
A theta join that only uses the operator = ## Footnote Example: Food's Shop attribute equals Locations' Name attribute.
269
What is a Natural Join?
A join where no predicate is specified ## Footnote Common attributes appear only once.
270
How is a Natural Join formalized?
Cartesian Product of two relations followed by a selection on equality and a projection ## Footnote Involves relations R & S.
271
What is a Left Outer Join?
Includes tuples from R that don't have corresponding tuples in S ## Footnote Missing values are set to null.
272
What is a Right Outer Join?
Includes tuples from S that don't have corresponding tuples in R ## Footnote A type of outer join.
273
What is a Full Outer Join?
Includes tuples from both relations that don't have corresponding tuples in the other relation ## Footnote A type of outer join.
274
What is a Semijoin?
Similar to a natural join, but resulting attributes are only from the left relation ## Footnote Useful for reducing the size of the result.
275
What is an Antijoin?
Contains tuples from R that have no match in S ## Footnote The reverse of a semijoin.
276
What are relational transformations?
Involves transforming one query expression into another equivalent query expression ## Footnote Aims to enhance efficiency during query processing.
277
What are Cascading Projections?
Only the last projection in nested projections is required if not extended ## Footnote Reduces complexity in query execution.
278
What are Conjunctive Selections?
Selections with conjunctive terms can cascade into individual selections ## Footnote Eliminates unnecessary data sooner.
279
What is the Commutativity of Selection and Theta-Join?
Selection and theta-join are commutative operations ## Footnote Applying the most restrictive filter first reduces rows early.
280
What is the Associativity of Joins?
Joins exhibit associativity ## Footnote Reordering joins helps find the most efficient execution plan.
281
What is the Distribution of Selection over Theta-Join?
Selection can be performed on both relations prior to the theta-join ## Footnote If the predicate only involves attributes being joined.
282
What is Relational Algebra?
A mathematical system consisting of operands & operators ## Footnote Includes set operations & relation-specific operators.
283
What is SQL?
A relationally complete database query language ## Footnote Can express all relational algebra queries.
284
What is the basic form of a SQL statement?
SELECT, FROM, and WHERE ## Footnote Fundamental components of SQL queries.
285
What does SELECT correspond to in Relational Algebra?
Projection π ## Footnote Used to specify the columns to be returned.
286
What does FROM correspond to in Relational Algebra?
Cartesian Product ## Footnote Specifies the tables involved in the query.
287
What does WHERE correspond to in Relational Algebra?
Selection σ ## Footnote Used to filter records based on conditions.
288
What is a Self-Join in SQL?
A join where a table is joined with itself ## Footnote Often requires aliases for clarity.
289
Why are aliases necessary in self-joins?
They improve readability and clarity ## Footnote Reduce typing effort in complex queries.
290
What are the rules for alias creation in SQL?
* Created in the FROM list * `FROM AS ` * New names referenced in the select list (SELECT * FROM R AS S ## Footnote Ensures clarity in queries.
291
What are the initial steps involved in processing a SQL query?
Scanning, Parsing & Validating ## Footnote Involves parsing the SQL query into an abstract syntax tree and validating table and attribute names against the system catalogue.
292
What is the output of the scanning, parsing, and validating stage?
An intermediate form
293
What is the purpose of the Query Optimiser?
To find the best way to execute the query
294
What does the Query Code Generator do?
Turns the optimised query into executable code
295
What is executed by the Runtime Database Processor?
The executable code generated from the optimised query
296
What are the steps involved in creating a Query Plan?
* Generating an initial query plan * Optimising it * Translating it into executable code
297
What is the purpose of rewriting and optimizing the initial query plan?
To improve plan's efficiency without specifying the exact algorithms to be used.
298
Name a technique for logical optimization.
* Using predicate decomposition (breaking down selections with AND conditions) * This technique rewrites a selection with conjunctive predicates (e.g., A AND B) into a sequence of simpler selections. * Example: Instead of `σ_{A AND B}(R)`, we write `σ_A(σ_B(R))`. * This helps in reordering and optimizing the query more effectively. ## Footnote Other techniques include reordering subtrees and combining Cartesian products with adjacent selections.
299
What is the goal of logical optimization?
To minimize the size of intermediate relations, thus reducing the overall cost of the query.
300
What is a useful approach when considering query plans?
Considering only left-deep trees.
301
What does the physical query plan involve?
Selecting algorithms for each operator in the logical query plan.
302
What happens during the transition from logical to physical query plan?
The query plan transitions from an abstract representation to a concrete, executable strategy.
303
What factors are considered during cost estimation in physical query plans?
I/O costs and memory usage.
304
True or False: Logical and physical optimization cannot influence each other.
False.
305
What can the existence of an index influence?
It might make a particular join order more efficient, leading to a different logical plan.
306
What characterizes a canonical form of a relational algebra query tree?
A left-deep tree of products, a conjunctive selection above the products, and a project of the output attributes above the selection.
307
Fill in the blank: A canonical form serves as an initial, unoptimised representation of a SQL query that a query optimiser can then transform into a more _______.
efficient execution plan.
308
What is the primary purpose of cost estimation in query optimization?
To predict the resources needed to execute a query plan.
309
What is the aim of choosing a logical query plan?
To minimize the sizes of the intermediate relations, thus minimizing the overall cost of the plan.
310
How is the cost of an operator defined?
As the cardinality of its output relation.
311
What contributes to the overall query plan cost?
The sum of the cardinalities of the intermediate relations, excluding the input relations and the final result.
312
What does the system catalogue store?
Statistics about each relation, used to make informed decisions about query costs.
313
What does T(R) represent?
The number of tuples in relation R (cardinality of R).
314
What does V(R, A) indicate?
The number of distinct values for attribute (column) A in relation R.
315
Fill in the blank: V(R, A) ≤ _______ for all attributes A on R.
T(R)
316
What is the formula for the scan operation in cost estimation?
T(scan(R)) = T(R) and V(scan(R), A) = V(R, A) for all attributes A in R.
317
How is the cost estimated for the Cartesian product of two relations?
T(R × S) = T(R)T(S) and V(R × S, A) = V(R, A) for all attributes A in R.
318
What is assumed in the projection operation regarding duplicate tuples?
The projection does not remove duplicate tuples.
319
In selection, what happens when the attribute equals a constant?
It leads to specific filtering of tuples based on the constant value.
320
What are the 4 types of selections for cost estimation?
Inequality, not equals, conjunction, disjunction
321
For join, what happens to T(R ⋈ S) if R and S do not have any value in common?
T(R ⋈ S) = 0.
322
If S is the key and a foreign key of R, what is T(R ⋈ S)?
T(R ⋈ S) = T(R).
323
What is the formula for estimating the number of tuples in the join of relations R and S?
Divides the product of the number of tuples in R and S by the maximum number of distinct values in either R.A or S.B.
324
What are further statistical methods used for improving cost estimation?
Methods include histograms that provide a more detailed view of data distribution within attributes.
325
What assumption does the approach of using distinct values make?
That each attribute value appears with equal frequency.
326
What type of histograms divides the attribute domain into equal parts?
Equal-Width Histograms.
327
How do Equal-Height Histograms work?
Sort tuples by attribute, divide them into equal-sized sets, and provide the maximum value for each set.
328
What do Most-Frequent Values histograms track?
The top-n most common values for each attribute, along with their frequency counts.
329
What is required to find the optimal query plan?
To compare all the possible plans.
330
What determines the number of possible binary trees with n leaves?
The Catalan numbers using the formula: ## Footnote Catalan numbers are a sequence of natural numbers that have many applications in combinatorial mathematics.
331
How many permutations exist over n leaves?
n! permutations. So total possible niary join trees with permuations = Catalan number x n! ## Footnote The factorial of n (n!) represents the product of all positive integers up to n.
332
What are the primary factors determining query cost?
The number of joins and the order they are executed in.
333
True or False: The number of query plans increases exponentially with each additional relation.
True.
334
Why are joins preferred over Cartesian products?
They produce smaller output relations.
335
What are query graphs used for?
Managing the complexity of join ordering and visualising relationships between relations.
336
What structure do query graphs have?
Undirected graph.
337
In a query graph, what do vertices represent?
The relations (R1, R2, …, Rn) used in the query.
338
What do edges indicate in a query graph?
Join predicates between relations.
339
What creates an edge between two vertices (Ri and Rj) in a query graph?
A predicate of the form a = aj, where a belongs to relation Ri and aj belongs to relation Rj.
340
What creates a self-loop on a vertex (representing Ri) in a query graph?
A predicate of the form a = const, where a belongs to relation Ri.
341
What is the purpose of join trees?
To constrain the search space by excluding join orderings that would lead to cross products.
342
What are the two main classes of join trees?
Linear join trees and bushy join trees.
343
What characterizes linear join trees?
Every join introduces at least one base relation.
344
What is materialisation in query processing?
Where output signals from an operator are written to disk in a semi-permanent fashion.
345
What is pipelining in query processing?
Where data is read, processed, and propagated to the next operator in the query plan without materialisation.
346
What are the possible shapes of join trees?
Left-deep, right-deep, or zig-zag.
347
What is the formula for the number of possible bushy trees?
n!C(n-1) = (2n)!/n!.
348
What is the heuristic approach in query optimisation?
Use transformation rules to modify the query plan. ## Footnote 1. Start with canonical form 2. Push σ operators down the tree 3. Introduce joins (combine x and σ to create ⋈) 4. Determine join order 5. Push π operators down the tree
349
What does the canonical form represent in query optimisation?
A structured representation of a query upon which optimisation techniques can be applied.
350
Fill in the blank: The canonical form should be a _______ tree of products.
left-deep.
351
What is the process of pushing σ operators down the tree?
To move selection conditions closer to the base relations.
352
What is the significance of reordering joins?
To potentially reduce the number of tuples processed based on join selectivity. ## Footnote How many projects called 'Aquarius'? Probably 1 How many people born after 1957? Quite a lot So move the PROJECT subtree down so it's at the bottom of the query plan
353
How do you create joins from canonical form?
354
Talk about moving pi (the project operator) down the tree.
If intermediate relations are to be kept in buffers, reducing the degree of those relations (no. attributes) allows us to use fewer buffer frames
355
What must be present to turn a Cartesian product into a join?
A predicate that relates something in one relation to something in another relation.
356
What do execution models define?
The interface that connects operators, how data is propagated between operators, how operators are scheduled within a query plan.
357
What is the goal of execution models?
To pipeline data through the operators, processing tuples one at a time to minimize materialization and maximize in-memory operations.
358
True or False: All operators can be pipelined.
False.
359
What is an iterator in query processing?
A standard interface for operators that facilitates data flow and pipelining.
360
List the methods associated with iterators.
* open() * getNext() * close()
361
What happens when getNext() is called in synchronous execution?
Operators don't generate tuples until getNext() is called.
362
What is the purpose of buffering in asynchronous implementations?
To minimize blocking.
363
What are the three models that can achieve asynchrony?
* Push model * Pull model * Stream model
364
What are the key traits of the Push Model?
* Data flows from leaves upward. * Producer pushes data as soon as it's ready. * Consumer buffers data until it calls getNext(). * Minimizes idle time; good for pipelining.
365
What are the key traits of the Pull Model?
* Data flow starts from the root. * Producer waits for getNext() before sending data. * Synchronous style but can include some asynchrony.
366
What are the key traits of the Stream Model?
* Uses FIFO queues between operators. * Producer adds tuples to queue immediately. * Consumer reads from queue if data is available (non-blocking). * Supports parallelism; asynchronous operators, synchronous streams.
367
What are physical-plan operators (PPOs)?
Algorithms that *implement* the basic relational operations used in query plans.
368
What factors influence the choice of algorithm for PPOs?
* Structure and size of the relations * Presence of indexes & hashes
369
What is a clustering file?
Refers to storing tuples from different relations that can be joined (based on shared attributes) in blocks together.
370
What is a clustered relation?
Tuples from the same relation are stored together in blocks, though not necessarily contiguous on disk.
371
What is a clustering index?
An index that allows tuples to be read in an order that corresponds to their physical order on disk.
372
What is scanning in the context of PPOs?
The operation of reading all tuples of a relation that satisfy some predicate.
373
What are the two main types of scanning?
* Table Scan * Index Scan
374
What is the I/O cost for a table scan if the relation is clustered?
B(R) disk accesses.
375
What is the I/O cost for an index scan if the relation is clustered?
B(R) + B(I_R) disk accesses.
376
What are one-pass algorithms?
Algorithms that read data from the disk only once and require that at least one argument fits in main memory.
377
What are the three broad categories of one-pass algorithms?
* Unary, tuple at a time * Unary, full-relation * Binary, full-relation
378
What are unary, tuple-at-a-time algorithms?
Algorithms that can be applied to a single tuple at a time and are non-blocking.
379
What is the cost for unary, tuple-at-a-time algorithms?
Generally B(R) or T(R) depending on clustering.
380
What are unary, full-relation algorithms?
Algorithms that require the entire relation to be examined before producing output and may be blocking
381
What is required for duplicate elimination in one-pass algorithms?
M ≥ B(δ(R)) + 1 blocks of main memory. Costs B(R)
382
What are binary, full-relation algorithms typically used for?
Union, intersection, difference, product, and join.
383
What is the general cost for binary, full-relation algorithms?
B(R) + B(S).
384
What is a nested-loop join (aka iteration join)?
A basic join algorithm that typically requires multiple passes over the inner relation for each tuple in the outer relation.
385
How costly is a tuple-based nested-loop join?
Very expensive, as each tuple requires a separate disk access.
386
What is the advantage of a block-based nested-loop join?
The outer relation is read in chunks, improving performance.
387
What is the benefit of join reordering?
Performance can be improved by reversing the order of the relations in the join.
388
What happens when both relations in a join are contiguous and sorted?
The cost is reduced as it uses the number of blocks in the cost calculation.
389
What is the merge join?
An efficient join method that reads each block of R1 and R2 once only.
390
What is a two-pass algorithm used for?
When the data is too large to fit in main memory, requiring multiple passes over the data to perform an operation. ## Footnote Two-pass algorithms are essential in scenarios involving large datasets, such as external sorting or joining operations.
391
Describe the process of Merge Sort in a two-pass algorithm.
Tuples are read, written, read, and written. Each (e.g.) 100 block chunk of R is: * Read * Sorted in memory * Written to disk * Read all the chunks * Merge the chunks * Write them out to a file. ## Footnote Merge Sort is efficient for external sorting due to its sequential access pattern.
392
How does the Hash-Join process function?
Relation is partitioned into M-1 buckets. * Read relation a tuple at a time * Hash each tuple to a bucket * Move full buckets to disk and reinitialize * Join R1, R2. ## Footnote This method reduces the overall memory requirement by processing smaller partitions.
393
What is the key assumption in index-based algorithms?
The attribute on which we're joining has an index that fits in memory. ## Footnote Index-based algorithms optimize join operations by leveraging existing indexes to minimize disk accesses.
394
What is the expected number of matching tuples when R2.a is a key and R1 has a foreign key?
1 expected matching tuple for each tuple in R1. ## Footnote This scenario is common in relational databases where foreign key relationships are established.
395
Calculate the expected matching tuples given V(R2,C) = 5000 and T(R2) = 10,000.
Expected matching tuples = 10,000 / 5,000 = 2. ## Footnote This calculation assumes a uniform distribution of values in R2.
396
What is the expected number of matching tuples if domain(R2, C) = 1,000,000 and T(R2) = 10,000?
Expected matching tuples = 10,000 / 1,000,000 = 1/100. ## Footnote This reflects a scenario with a much larger domain, indicating sparser matches.
397
What does concurrency relate to in a DBMS?
How a DBMS supports multiple users accessing the system simultaneously ## Footnote Concurrency is essential for multi-user environments.
398
What happens when client programs concurrently access data items?
They create a copy, modify it, and then commit the changes back to the central database ## Footnote This process can lead to potential consistency issues if not managed correctly.
399
What is consistency in the context of a database?
A database is in a consistent state when all constraints are satisfied ## Footnote Constraints are derived from application requirements.
400
Give an example of inconsistency in a database.
Multiple clients modifying the same database can lead to problems ## Footnote If not managed properly, it can compromise consistency and isolation.
401
Define 'operation' in relation to a database.
A basic action on a database by a client program (read, write) ## Footnote Examples include read(X) and write(X).
402
What does read(X) do?
Reads a database item Xd into a program variable XT in transaction T ## Footnote This operation retrieves data for processing.
403
What does write(X) do?
Writes the value of program variable XT in transaction T into the database item Xd ## Footnote This operation updates the database with new information.
404
What is a conflict in database operations?
Two operations on a database conflict if their execution doesn't commute ## Footnote Conflicts can lead to inconsistencies.
405
What is a transaction?
A set of operations meant to be executed together (like in a block) ## Footnote Transactions ensure data integrity.
406
What happens when a transaction commits?
It terminates successfully; otherwise, it aborts/fails ## Footnote This is critical for maintaining database integrity.
407
What are the ACID properties of transactions?
Atomic, Consistent, Isolated, Durable ## Footnote These properties ensure reliable transaction processing.
408
Define 'Atomic' in the context of ACID properties.
All operations within a transaction are executed, or none are ## Footnote This ensures that partial transactions do not affect the database.
409
What does 'Consistent' mean regarding transactions?
A transaction must leave the database in a consistent state, adhering to defined constraints ## Footnote This is vital for maintaining data integrity.
410
Explain 'Isolated' in the context of transactions.
Each transaction must read from a consistent database state, isolated from the 'work in progress' of other transactions ## Footnote Isolation helps prevent conflicts between transactions.
411
What does 'Durable' mean in ACID properties?
The effects of a committed transaction aren't lost ## Footnote This ensures that once a transaction is completed, its changes are permanent.
412
What is a naïve solution to ensure consistency and isolation in transactions?
To execute transactions serially ## Footnote However, this can lead to longer wait times for users.
413
What is the downside of executing transactions serially?
Users may have to wait a long time before getting an answer from the system ## Footnote This can decrease system throughput.
414
What are concurrent transaction problems?
Operations from different transactions may be interleaved, potentially breaking consistency/isolation ## Footnote This is a challenge in multi-user database environments.
415
Define the Lost Update Problem.
Occurs when concurrent transactions overwrite each other's updates without proper management ## Footnote This leads to data loss or inconsistency.
416
What is the Inconsistent Read Problem?
Occurs when one transaction reads data that is modified by another ongoing transaction ## Footnote For example, checking a balance while a transfer is in progress.
417
What is the Dirty Read Problem?
Occurs when a transaction reads data written by another transaction that has not yet been committed ## Footnote This can lead to inconsistencies if the first transaction rolls back.
418
How are transactions ordered?
Each transaction defines a partial order of operations that indicates what needs to be executed before what.
419
Define a transaction schedule.
A schedule is a partial order, constructed from the union of operations performed by multiple transactions.
420
What does a transaction schedule preserve?
The order of operations as defined within each individual transaction.
421
What is the goal of defining a transaction schedule?
To ensure database consistency.
422
What is schedule equivalence?
A transaction schedule is considered serialisable if its effect on the database is equivalent to that of a serial schedule.
423
What is a serialisable schedule?
A schedule that produces the same result as executing transactions one after the other, without any overlap.
424
What is a non-serialisable schedule?
A schedule that cannot be rearranged to produce a serial effect on the database.
425
What does the transaction management problem involve?
Receiving a set of concurrent transactions as input and producing a serialisable schedule as output.
426
What factors must be considered in transaction management?
* Number of concurrent transactions * Throughput * Consistency
427
What are the two main methods for creating serialisable schedules?
* Locking (and 2PL) * Timestamp Ordering
428
What is the purpose of locking in transactions?
Locks data items that are currently being accessed by a transaction to prevent others from accessing/modifying the same data until the lock is released.
429
What happens when a transaction attempts to access a locked data item?
The requesting transaction waits until the lock is released or if it is aborted.
430
What is a shared lock?
A lock for reading that allows multiple transactions to hold shared locks on the same data item simultaneously.
431
What is an exclusive lock?
A lock for writing that allows only one transaction to hold an exclusive lock on a data item.
432
What are the lock operations?
433
What is a lock table?
A table maintained by the DBMS to track the lock status of each data item.
434
What does lock compatibility determine?
Whether a lock request can be granted or not.
435
What is Two-Phase Locking (2PL)?
A concurrency control technique that guarantees serialisable transactions.
436
What are the two phases of 2PL?
* Growing Phase: Transactions obtain locks on data items * Shrinking Phase: Transactions release the locks they have acquired
437
What is a deadlock?
A situation where two or more transactions are waiting for each other to release a lock on an item.
438
What are the conditions for a deadlock to occur?
* Concurrency * Hold * Wait * Mutual dependency
439
What is deadlock prevention?
Every transaction locks all items it needs in advance; if an item can't be obtained, no items are locked.
440
What is deadlock detection and reversal?
Detecting the deadlock and aborting one of the involved transactions.
441
What is a Wait-For Graph?
A directed graph where each executing transaction is a vertex, and an edge exists if one transaction is waiting on another.
442
What happens if a transaction waits too long for a resource?
The system assumes that the transaction is deadlocked and aborts it.
443
How does the granularity of data items influence deadlocks?
Locking larger units reduces concurrency but can prevent deadlocks; smaller units increase concurrency but can lead to more deadlocks.
444
What is timestamp ordering?
An alternative to locking that aims to ensure serialisable schedules based on transaction timestamps.
445
What is the fundamental principle of timestamp ordering?
Conflicting operations are executed in order of their transaction timestamps.
446
What happens when a transaction attempts to perform a read operation when using timestamp ordering?
If TS(T) >= write-TS(X), the read operation is executed; otherwise, the transaction is aborted.
447
What happens when a transaction attempts to perform a write operation?
If TS(T) >= read-TS(X) and TS(T) >= write-TS(X), the write operation is executed; otherwise, the transaction is aborted.
448
What is the purpose of recovery in database systems?
To ensure Atomicity and Durability in the presence of failures/crashes.
449
What is Logging in the context of database operations?
A persistent record of changes made during DB operation.
450
What type of files does the log manager use for Logging?
Append-only files.
451
How is recovery made?
By reading from the log.
452
What is Undo Logging?
A type of logging used to repair a DB by undoing incomplete transactions after a crash.
453
What must be written to disk before a transaction modifies a database item?
must be written to disk.
454
What happens during recovery if `` hasn't been seen?
The value of X is set back to old_value, and incomplete transactions are undone.
455
List the two rules of Undo Logging.
* U1: A log record `` must be written before outputting new value. * U2: A `` log record must be written after all changes are output.
456
What is the issue with rule U2 in Undo Logging?
It potentially causes more disk I/O operations.
457
What does Checkpointing with Undo Logging do?
Creates a point ensuring all transactions before that point have committed or aborted.
458
What is the new log record type introduced in Checkpointing?
``.
459
What are the steps in Quiescent Checkpointing?
* Stop accepting new transactions. * Wait for active transactions to commit or abort. * Flush the log. * Write a checkpoint record `` to the log. * Flush the log again. * Resume accepting transactions.
460
What is Nonquiescent Checkpointing?
Allows new transactions to enter during the checkpoint process.
461
What are the new record types in Nonquiescent Checkpointing?
* `` * ``.
462
What happens if appears latest during recovery?
Disregard the log before the previous ``.
463
What must be done if `` appears latest?
Disregard the log before the start of the earliest incomplete transaction.
464
What is Redo Logging?
Reapplies changes made by transactions that successfully committed before a crash.
465
What is the purpose of Redo Logging?
Ensures that the effects of committed transactions aren’t lost.
466
What is the record type used in Redo Logging?
.
467
What is the rule for Redo Logging?
Before modifying a database item X on disk, all log records related to the modification (``, ``) must be written to disk
468
What are the steps in Checkpointing with Redo Logging?
* Write log record `` and flush log. * Write to disk all database items that have been written to buffers by committed transactions. * Write log record `` and flush log.
469
What is the recovery process if `` is found?
Ignore changes made by transactions that committed before the corresponding ``.
470
What should be done if `` appears during recovery?
Search back to the previous ``.
471
What is an I/O Bottleneck?
The situation where DBMS performance is primarily limited by the time taken to access secondary storage (HDD/SSD) ## Footnote Latency of storage, particularly writing to the disk, is the most expensive part of all DB operations.
472
Define Monolithic Architecture in DBMS.
Has a single processor, tasks may be interleaved but does not achieve true parallelism/concurrency, a single bank of memory, a single buffer pool, and a single disc ## Footnote DB stored on a single storage device.
473
What characterizes Shared Memory Architecture?
Tightly coupled, employs a symmetric multiprocessor (SMP) system, multiple processors for executing tasks in parallel, single global memory, and a single buffer ## Footnote Creates a single point of failure.
474
List the key features of Shared Disc Architecture.
* Multiple processors for parallel execution * Distributed memory, each processor has its own local memory * Loosely coupled, processors communicate through an interconnection network or a switch * Multiple discs
475
What are the key features of Shared-Nothing Architectures?
* Massively parallel * Loosely coupled, processors operate independently * Distributed memory, each processor has its own private memory * Each processor owns part of the data * A data page resides in the buffer pool of only one local memory
476
What challenges are associated with Shared-Nothing Architectures?
* How to divide a query/transaction among nodes * How to partition data across nodes * How to keep the data partition balanced * How to control concurrency and avoid deadlocks * What to do if a node fails/crashes
477
What is the primary motivation behind Parallel Query Processing?
To overcome the I/O bottleneck by splitting the processing and accessing of data across multiple processors and disks.
478
How is work divided in Parallel Query Processing?
Work is divided among several processors, usually with a coordinator process managing execution and multiple worker processes performing parts of the query in parallel.
479
Define Inter-Query Parallelism.
Involves executing different queries concurrently on different processors.
480
Define Intra-Query Parallelism.
Involves decomposing a single query into sub-queries to be executed in parallel on different processors.
481
What is Horizontal Parallelism?
An operator is executed in parallel on different subsets of the data.
482
What is Vertical Parallelism?
The output of one sub-operator is passed as input to another sub-operator, allowing them to execute concurrently.
483
What is Intra-operator (horizontal) parallelism?
Operators decomposed into independent operator instances, performing the same operation on different subsets of data.
484
What is Inter-operator (vertical) parallelism?
Operations are overlapped, allowing data to be pipelined from one stage to the next without materialisation.
485
Define Bushy (independent) parallelism.
Subtrees in query plan executed concurrently.
486
What is partitioning?
Refers to how we divide data ## Footnote Aims to reduce execution time and balance I/O load evenly across processors
487
What factors are considered in partitioning?
* Number and capacity of available nodes (not processors) * Query workload * Network topology and distance
488
What is Round Robin Partitioning?
Uses modulo calculation to distribute data randomly ## Footnote Can lead to related data being spread across many nodes, making certain queries slow
489
What is Hash Partitioning?
Applies a hash function to a subset of attributes of a row to determine storage node ## Footnote Guarantees balanced distribution of rows across nodes
490
When is Hash Partitioning efficient?
Efficient for exact-match queries ## Footnote Example: Fast for WHERE StudentId=X, slow for WHERE Exam > 40
491
What is Range Partitioning?
Allows more control over row placement and can optimize queries on the chosen attribute ## Footnote Effectiveness depends on having stable, predictable data ranges
492
What are the potential downsides of Range Partitioning?
Can lead to imbalanced partitions depending on data distribution
493
What is the purpose of rebalancing data?
To keep partitions balanced
494
What is Data Shipping?
Worker nodes send relevant data to a central coordinator process ## Footnote The coordinator performs main query operations on aggregated data
495
How does Data Shipping work with the query SELECT StudentID WHERE Exam>50?
Each worker sends its 25,000 tuples to the coordinator, which processes the combined 100,000 tuples
496
What is Query Shipping?
Coordinator sends the query to worker nodes, which execute it locally and send results back ## Footnote The coordinator combines partial results to produce the final answer
497
What is a key advantage of Query Shipping?
Can significantly reduce network traffic if the query is selective
498
What is a challenge associated with Query Shipping?
Decomposition for some operators or queries can be difficult
499
How does data partitioning affect Query Shipping?
Effectiveness heavily depends on how the data is partitioned
500
What is the importance of cardinality estimation in query processing?
Crucial for making informed decisions about shipping strategy efficiency
501
What is a key difference in parallelism between Data Shipping and Query Shipping?
Query shipping leverages parallelism of worker nodes, while data shipping may bottleneck at the coordinator
502
What are the processing cost assumptions in a parallel architecture?
* Scan and union are free * Select costs are Card(R) * Aggregate costs are Card(R) * Join costs depend on indexing
503
What does Ship(R) cost in a parallel architecture?
Costs Card(R) to ship relation R to another node
504
What does Broadcast(R) cost in a parallel architecture?
Costs Card(R) due to parallel shipment
505
What is the significance of visualisation caveats in parallel architecture?
Operations taking longer should be depicted with wider blocks for clarity
506
What does cost estimation in parallel databases involve?
Optimizing a query plan as if on a single processor
507
What is a No-Join Query?
Involves operations where no join between relations is needed, such as aggregations or selections within a single table.
508
Define Co-Located Join.
Used wwhen the relations that need to be joined are partitioned on the same key attribute, so corresponding tuples from the two relations reside on the same node.
509
What is a Directed Join?
Used when the relations to be joined are partitioned on their respective join key attribute, but not necessarily the same key, so co-location is not guaranteed.
510
What is a Broadcast Join?
Used when one of the relations to be joined is relatively small compared to the other.
511
Explain Repartitioned Join.
Used for when the relations are large and not partitioned on the join key.
512
What does reliability in a system refer to?
The ability of a system to maintain data integrity and consistency and to recover from failures.
513
What is Durability in the context of transactions?
Ensures that once a transaction is committed, the changes to the database are persistent.
514
Define Atomicity.
Requires that for any transaction, either all operations are executed successfully, or none are.
515
Why are logging and recovery mechanisms important?
They are vital for achieving durability and atomicity.
516
What does `` indicate?
Transaction T has started execution.
517
What does `` mean?
Transaction T has completed successfully and will make no further changes to database items.
518
What does signify?
Transaction T could not complete successfully. No changes made by T will be copied to disk.
519
What is the role of the coordinator in 2PC?
Involves a coordinator and one or more worker nodes to ensure atomicity of transactions.
520
Describe the Voting Phase in 2PC.
Coordinator sends a 'prepare T' message to all worker nodes, which execute their part and send back a 'vote-commit T'/'vote-abort T' message.
521
What happens in the Decision Phase of 2PC?
Coordinator analyzes votes from workers and makes a commit or abort decision.
522
What is a Commit Decision?
If all workers vote-commit, the coordinator commits the transaction and sends commit messages to all workers.
523
What is an Abort Decision?
If any worker votes-abort, the coordinator aborts the transaction and informs all workers.
524
What triggers the Termination Protocol?
Activated when a timeout occurs.
525
What is the Cooperative Termination Protocol?
Assumes participants are aware of each other and tries to find out the coordinator's decision after a timeout.
526
What is the Recovery Protocol?
Initiated when a coordinator or participant restarts after a crash.
527
Define Presumed-Abort variant of 2PC.
Allows the coordinator to forget about transactions if the global decision is to abort.
528
What is the Presumed-Commit variant?
Assumes that if no information about a transaction is in memory, it must have been committed.
529
Fill in the blank: A log is a _______.
persistent, append-only record of changes.
530
True or False: The Voting Phase is the second phase of 2PC.
False.
531
What does the log record indicate?
It indicates that a worker is ready to commit or abort changes locally.
532
What is a distributed system?
A system consisting of multiple machines that are far away from each other, controlled by the same organization, typically in different data centers.
533
What is a characteristic of a distributed system?
Characteristics include: * Multiple machines (>50) * Homogeneous data format (relational) * Same hardware across machines * No reliance on a central site.
534
What is the goal of fragmentation in a distributed system?
To break down large databases into smaller, more manageable units.
535
What are the requirements for fragments in fragmentation?
The fragments should be: * Disjoint * Complete.
536
What does disjoint mean in the context of fragmentation?
No tuple of the global relation appears in more than one fragment.
537
What does complete mean in the context of fragmentation?
Every tuple in the original global relation must belong to at least one of the fragments.
538
What are the goals of distributing a system?
Goals include: * Reduce overall execution time * Balance I/O load evenly across nodes.
539
What are the 2 approaches for fragmentation?
540
What factors should be considered when distributing a system?
Factors include: * Number and capacity of available nodes * Query workload * Network topology and distance.
541
What is a reconstruction program?
542
What is localisation, in the context of distributed query processing?
It deals with how a query formulated against a global, non-distributed database schema operates on fragments of data across different nodes.
543
What is a naive approach in localisation, in the context of distributed query processing?
Involves substituting localization programs for relations in the original query, potentially leading to expensive operations.
544
What are reduction techniques in distributed databases?
Techniques employed to optimize queries by reducing the amount of data transferred across the network.
545
What reduction techniques are used for horizontal fragmentation?
Reduction with selection Reduction with join
546
What reduction techniques are used for vertical fragmentation?
Redution with projection
547
What is selection reduction?
Applied when a query includes a selection operation (WHERE), identifying irrelevant fragments to avoid accessing them.
548
What is join reduction?
Optimizes join operations by identifying pairs of fragments whose join will be empty and avoiding unnecessary joins.
549
What is projection reduction?
Aims to reduce data processed and transferred by eliminating unnecessary fragments for a query involving a projection operation.
550
For distributed joins, where do we perform the join?
551
What is semijoin reduction?
Moves only the part of one relation needed for the join.
552
Fill in the blank: The machines in a distributed system should be _______.
self-sufficient.
553
True or False: In a distributed system, there should be reliance on a central site.
False.
554
What is the purpose of a localization program in distributed databases?
To reconstruct global relations using expressions derived from fragments.
555
What does Two-Phase Lock (2PL) guarantee in distributed databases?
Serialisability, the highest isolation level ## Footnote 2PL is essential for preserving isolation in distributed DBs.
556
What are the two types of Two-Phase Locking?
* Centralised 2PL (C2PL) * Distributed 2PL (D2PL) ## Footnote C2PL uses a single site for lock management, while D2PL has lock managers at each site.
557
How does Centralised 2PL (C2PL) manage lock requests?
The transaction manager sends a lock request to the central lock manager ## Footnote The central lock manager decides whether to grant the lock.
558
What is a major drawback of Centralised 2PL?
Single point of failure and potential bottleneck ## Footnote Every lock request and release must go through the central lock manager.
559
In Distributed 2PL (D2PL), where are lock requests sent?
To the local lock manager at each participant site ## Footnote Each participant site has its own lock manager.
560
What occurs once a data processor finishes executing an operation in D2PL?
It sends an 'end of operation' message back to the transaction manager ## Footnote This informs the transaction manager that the operation is complete.
561
What is deadlock in the context of distributed systems?
Occurs when 2 or more transactions are waiting for each other to release a lock on an item ## Footnote Deadlocks can severely impact system performance.
562
What are the three conditions for deadlock occurrence?
* Concurrency * Hold * Wait ## Footnote These conditions must be met for a deadlock to occur.
563
What is a Wait-For Graph (WFG)?
A directed graph that represents the dependencies between transactions ## Footnote It helps in detecting deadlocks.
564
What indicates the presence of a deadlock in a Wait-For Graph?
The presence of a cycle ## Footnote A cycle in the WFG signifies that transactions are waiting indefinitely.
565
What are the two types of Wait-For Graphs?
* Local WFG * Global WFG ## Footnote Local WFGs are per site, while Global WFGs consider all transactions.
566
What is one technique for deadlock prevention?
Transaction pre-declaration ## Footnote Transactions declare all accessed data items in advance. TM only locks if all items are available.
567
What is resource ordering in deadlock avoidance?
Requiring transactions to always access resources in a predefined order ## Footnote This can be challenging in dynamic databases.
568
What does transaction prioritisation involve?
Using timestamps to decide which transaction to abort when a lock request is denied ## Footnote Rules like WAIT-DIE and WOUND-WAIT are examples of this approach.
569
What happens in the WAIT-DIE rule?
An older transaction waits for a younger one, while a younger one is aborted if it requests a lock held by an older one ## Footnote This helps to manage transaction priorities.
570
What is Centralised Deadlock Detection?
One site acts as the deadlock detector, checking for cycles in the Global Wait-For Graph ## Footnote Each site sends its Local WFG to the central detector.
571
What is one drawback of Centralised Deadlock Detection?
Single point of failure ## Footnote If the deadlock detector fails, deadlocks may go undetected.
572
What is the function of Hierarchical Deadlock Detection?
Organizing deadlock detectors in a hierarchy for monitoring ## Footnote Local detectors report to higher-level detectors.
573
What is Distributed Deadlock Detection?
Responsibility for deadlock detection is shared among sites ## Footnote This allows for a more robust detection mechanism.
574
What is a simple method for deadlock resolution?
Using timeouts to abort transactions waiting too long for a resource ## Footnote This method assumes that long wait times indicate a deadlock.
575
What is replication in the context of data management?
Replication is an extension of the fragmentation problem that involves creating multiple copies of data across different geographical locations.
576
List the reasons for using replication in data management.
* Latency reduction * Availability * Resilience * Performance
577
How does replication reduce latency?
By storing copies of the data closer to users in different geographical areas.
578
What is the impact of replication on system availability?
It increases availability by having multiple copies; if one replica fails, data can be accessed from other replicas.
579
Explain how replication contributes to resilience.
If a node containing a replica fails, transactions can be rerouted to other nodes with copies of the required data.
580
What benefit does replication provide in terms of performance?
It balances the read workload across multiple replicas, reducing bottlenecks and increasing throughput.
581
What is a key feature of each replica in a replicated system?
Each replica has its own transaction management system.
582
Define Strong Mutual Consistency.
All copies of an item have the same value after the execution of an update transaction.
583
Define Weak Mutual Consistency.
All copies of an item will eventually have the same value after the execution of an update transaction.
584
What does the term 'epsilon' refer to in data consistency?
Epsilon defines a bound on the allowed inconsistency, specifically the number of missing writes.
585
True or False: Inconsistent reads are never allowed in a replicated system.
False
586
Fill in the blank: The first read is made before A is updated in the right DB but is allowed because only _______ write missed.
1
587
What is the Time Bound consistency criterion?
It allows reads as long as they are within a defined bound of time units.
588
What is Time Drift in the context of data consistency?
It considers the average/combined temporal difference across multiple items accessed in the same transaction.
589
What is the significance of mutual consistency not being equal to serialisability?
Mutual consistency means replicas have the same value, but does not ensure that updates occurred in a single, step-by-step order.
590
Define Conflict-Free Replicated Data Types (CRDT).
A type of data object that can be used in replicated systems to facilitate lazy distribution of updates without leading to conflicts.
591
List the three key properties that make operations on CRDTs conflict-free.
* Associative * Commutative * Idempotent
592
How do CRDTs benefit replicated systems?
Updates can be lazily distributed without immediate propagation, ensuring the final state is consistent regardless of update order.
593
What is the advantage of lazy distributed replication?
It is practical for systems with network latency or intermittent connectivity.
594
What is churn in the context of decentralised systems?
The dynamic rate of participation of nodes within the network.
595
How does churn affect decentralised systems?
It makes fragmentation and replication of data more challenging.
596
What characterizes unstructured P2P networks?
Extreme autonomy with no control of the network topology.
597
What is a key design principle of unstructured P2P networks?
The lack of a central overseer.
598
List two advantages of unstructured P2P networks.
* Very resilient * Supports maximum autonomy
599
List two disadvantages of unstructured P2P networks.
* Unpopular items are not replicated enough * Enormous communication cost
600
What is the purpose of a routing table in decentralised systems?
To manage data and queries.
601
What does a routing table describe?
The range of hash keys stored in a specific node.
602
What is the challenge of maintaining a full routing table at each node?
It becomes expensive due to the need for synchronisation.
603
What is an overlay network?
A network where nodes connect to virtual nodes to control topology.
604
Name three overlay network geometries.
* Tree * Hypercube * Ring
605
What is the Patry Algorithm?
Each node maintains a routing table that stores the address of one node representative of a different prefix.
606
What is a Sybil attack?
Creation of multiple identities by an attacker to manipulate voting processes.
607
How does Proof of Work (PoW) counter Sybil attacks?
By shifting validation from identity count to computational effort.
608
What is a drawback of Proof of Work?
Difficulty tuning the puzzle can affect transaction confirmation speed.
609
What is Proof of Stake (PoS)?
A system where participants stake their own economic value to become validators.
610
What happens to validators in PoS if they are found cheating?
They lose their stake.
611
What are some requirements of Proof of Stake?
* Requires validators to reveal identity * Openly auditable transactions
612
What is needed to mitigate channel attacks in PoS?
Cryptography.