Query processing Flashcards

1
Q

How are each of the acid properties fulfilled?

A

A: undo/redo logging
C: serialisable schedules
I: serialisable schedules and isolation levels
D: redo/undo logging, recoverable schedules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is query processing used for?

A

Here we’re taking in queries and making a sequence of operations we can actually do on a database and then executing these on the physical hard disk in the execution engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean that SQL queries are declarative

A

They tell the DBMS what we want, not how to get it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How Does Query Processing Work?

A
  • Preprocessing might involve figuring out how tables are involved etc..
  • Logical query (for the purpose of this course is relational algebra)

-Optimising by modifying it.

-physical query plan involves choosing between different algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a query plan?

A

A relational algebra expression that is obtained from an SQL Query
- they tell us exactly how to compute the result to a query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are query plans typically represented?

A
  • As trees
  • leaves are input relations
  • inner nodes are operators
  • they are evaluated from the leaves to the root (so bottom-up approach)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do DBMSs use relational algebra over SQL for query plans?

A
  • because we can then use equivalence laws to generate more optimised query plans
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we apply the select operation on a relation R

A

For each tuple t in R, if t satisfies condition, output t
- this requires reading the whole file
- so is very slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is a naive computation of a natural join carried out?

A
  • A nested loop is used to do the natural join, so for each tuple in a table you look at each tuple in the other table and if they have the same value for all the common attributes you output the tuple obtained by combining these two tuples (rows)
  • for each tuple you have to read the entire file on the other relation to check if they match
  • this is very very slow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the run time of the Naïve Computation of Joins on relations S and R

A

|R| * |S| = number of tuples in R times number of tuples in S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are equijoins?

A

Equijoins are just a variant of natural joins, but equijoins allow you to have two columns with the same output whereas natural join removes one of them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the run time of an equijoin?

A
  • |R| + |S| + run time equal to size of the output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the rules for doing an equijoin?

A
  • start at first row in attribute A, if the values of A and B match, we output this to the joined table, if they don’t match, we increase A (as in go to the new row down) until it’s equal to B
  • when A and B are equal, we output this to the joined table, then we increase B
  • if when we increase B they are still equal we output this to the joined table. If they’re not equal we go back to increasing A.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens if we’re doing the equijoin of R and S and column A has a lot of duplicated?

A

The size of the output will be very big (not efficient)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What sorting algorithm do we use to sort R by A and S by B and how does this affect the run time?

A
  • we use a merge sort which gives a run time of O(log2(R))
  • much faster than a nested loop
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the run time of an equijoin (including sorting) if there’s not too many duplicates in A?

A

O(|R|log2|R| + |S|log2|S|)
- we can leave out the size of the output because this is linear so will be insignificant as the size of R and S increase (only if there aren’t too many duplicate rows in A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which algorithm is typically faster? merging sort algorithm or nested loop?

A

Merging sort algorithm
- exponential as opposed to quadratic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is it important that we take into account that relations are stored on the hard disk?

A
  • RAM is often no big enough to hold all of a database in it at once
  • though each time you access the disk it takes a lot of time because you have to find the exact location of the item you need
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can we read an entire relation in a faster way?

A

Yes, selection can be done faster if we know where to find the rows for G401
- this can be done by sorting and indexing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does an index do?

A

an index addresses one or more attributes and corresponds to some table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a primary index?

A

It points to the location of record on the disk and the data is sorted based on value
- typically the primary index is exactly what you define with primary keys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a secondary index?

A
  • It points to the location of records on a disk
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the run time in regards to searching for an item in the table of indexes?

A
  • We can use a binary search for this which has time log2K where K is the size of the table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the run time for then going through each row that is listed in the index?

A
  • this is just the number of items associated to the index specified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can we make querying even faster if we’re using an index and the condition specifies two different attributes?

A
  • instead of just using an index for only one of the items then having to go through all of those rows
  • we can search through two different index tables, each one applying to one of the two specified attributes, then take the intersection of the union of the rows selected from both index tables.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are advanced indicies?

A

Index tables that feature indexes that apply to multiple attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the two common types of indexes? and what are the SQL commands used for them?

A
  • B+ Trees
    CREATE INDEX ON students USING btree (programme, year)
  • Hash Tables
    CREATE INDEX ON students USING hash (lower(name));
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

When is it good to use B+trees for indexing?

A
  • If the selection we are after specifies the range (and this is one of the most widely used index)
  • so for instance a range such as having the attributes program = G401 and year < 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

B+ trees are a specific kind of index, what index is this and what does it mean?

A
  • B+ trees are a specific kind of index known as a multilevel index which means there are indexes over indexes.
  • In a multilevel index you can have any number of these levels and just branch out more and more.
  • Each level i index points to i+1 and the final level of index points out to the hard disk with the actual records where the information is stored.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is stored in the leaves of B+ trees?

A
  • a leaf is formed of two rows
  • On the top row we have values and on the bottom row we have their corresponding pointers which point to tuples with that value. - We also have a next leaf pointer
  • These values, that was also the case for the multilevel index, are sorted increasingly in order.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How is the value n chosen for B+ trees and what does it represent?

A

n = chosen so that a node fits into a single disk block

32
Q

What is the size of n, if the disk block size is 512 bytes, the values are 4 byte integers and the pointers are 8 bytes?

A

for every value there must be a pointer and then we must add on the next leaf pointer (which doesn’t have a value)
- so we do:
n can be 42 because
42 * (8+4)

33
Q

What is stored in the inner nodes of B+trees?

A
  • same as leaves in the fact they have two rows
  • however instead of the first row being values they are pointers, so the pointer point to pointers (which eventually whittle down to values)
  • the pointers point to other nodes instead of tuples
34
Q

in what order are fields in nodes filled?

A

From left to right
- not every field has to be filled

35
Q

What rule ensures that all nodes are at least half full?

A

A node must have at least (n+1)/2 pointers (rounded down) used (we always count the next leaf pointer even if there are no other nodes)

36
Q

How many pointers must the root node have?

A

At least two (or more)

37
Q

If n = 3 what is the least amount of nodes that must be in each node?

A

(n+1)/2 = 3+1 / 2 = 2

38
Q

If the max degree = 5 then what is the max number of values you can have in each index table?

A

max-1
- so 5 minus 1 = 4

39
Q

how do we look up values in a b+tree?

A

We just follow the values down until we reach the lowest level (which has pointers to tuples)
- once we find the value on the lowest level we can output the resulting tuple/tuples

40
Q

What happens if a range is specified when looking for values in a b+tree?

A

Multiple tuples that all satisfy the condition will be outputted in order (so from left to right, which is smallest to biggest)

41
Q

What do we do if we get a pointer that is not a value in the B+tree?

A
  • What we do is traverse down the tree normally until we reach the last level
  • We see that the pointer is not in the tree.
  • So we return nothing
42
Q

What is the run time for looking up a value in a B+tree?

A
  • O(hlog2n) real running time = O(HD)
  • h being the height of the tree
  • n is number of nodes
  • running time is to do with disk accesses so this is the height of the tree multiplied by d which is the time for a single disk operation to be carried out.
43
Q

How do we insert a Value/Pointer Pair if there is space in the node?

A

First we search for that value in the table, starting at the root, we traverse down to the bottom level, we insert it there if there is space in the node
- if it’s the smallest value in a node, we need to update the pointers in all the nodes above

44
Q

How do we insert a Value/Pointer Pair if there isn’t space in the node?

A

First we search for that value in the table, starting at the root, we traverse down to the bottom level
- since there’s no space in the node, we need to split it into two
- to do this you take half the nodes from the old node and put them in the new node
- however we now need a pointer to this node, so we update all the nodes in the levels above, add the smallest value of the new node to every level above

45
Q

How do we delete a node from a B+ tree if there are enough pointers left in each node after we delete the value?

A
  • We traverse down the tree as normal starting at the root, using a binary search to find the position of the value we want to delete
  • Then we remove the value. Now we need to run some checks that each node still contains at least two pointers.
  • if it does you only need to check that you didn’t delete the smallest value in a node, if you did then you need to adjust the pointers in the higher nodes
46
Q

How do we delete a node from a B+ tree if there aren’t enough pointers left in each node after we delete the value?

A
  • We traverse down the tree as normal starting at the root, using a binary search to find the position of the value we want to delete
  • Then we remove the value. We check that each node still contains at least two pointers.
  • since it doesn’t we need to steal a value from an adjacent node if it has more than the minimum amount of pointers required
47
Q

How do we delete a node from a B+ tree if there aren’t enough pointers left in each node after we delete the value?

A
  • We traverse down the tree as normal starting at the root, using a binary search to find the position of the value we want to delete
  • Then we remove the value. We check that each node still contains at least two pointers.
  • since it doesn’t we need to steal a value from an adjacent node if it has more than the minimum amount of pointers required
  • then we just update the higher level nodes if necessary
48
Q

What happens if we don’t have a pointer we can steal from an adjacent node when we’ve deleted a value from the B+tree?

A
  • we remove the value as usual and are left with a node that doesn’t satisfy the leaf condition
  • In this case there is only one adjacent node and it is not above the minimum pointer amount.
  • So we pick an adjacent node and merge them together
  • we then remove the empty node
  • we then update all the nodes in the high level to remove pointers pointing to the empty node
49
Q

What is the run time for deleting a value in a B+tree?

A
  • O(hlog2n) real running time = O(HD)
  • h being the height of the tree
  • n is number of nodes
  • running time is to do with disk accesses so this is the height of the tree multiplied by d which is the time for a single disk operation to be carried out.
50
Q

What is ensured since we keep satisfying the requirements on each internal node of the B+tree?

A
  • that the height is log2(n) of however many records we have.
51
Q

Why are B+trees a really useful method for indexes?

A
  • because most of the B+tree can be kept in memory so it’s efficient/quick to fetch from
52
Q

How do you calculate the block size of each level in the B+tree?

A

if it’s level 2 you do (n+1) x block size
if it’s level 3 you do (n+1) x (n+1) x block size
if it’s level 4 you do (n+1) x (n+1) x (n+1) x block size
and so on

53
Q

Why optimise query plans?

A
  • The initial query plan might not be the optimal one.
54
Q

Finish the sentence: The more tuples in a table ___

A

The the longer it will take to complete the cross product.
- that’s why the method of selecting the tuples with stores in Liverpool first is a faster, more efficient approach

55
Q

What two operations can be combined to make querying more efficient?

A

Can combine a select operator together with a cross product into an equijoin then use a sort join to join them even faster.

56
Q

So how do we optimise query plans?

A
  • first we evaluate a query plan (bottom up) because efficiency depends on size of intermediate results
  • so we rewrite the query plan so intermediate results are as small as they can be, based on equivalence laws
57
Q

If you have two selection queries joined by an AND or an OR operator how can you make them more efficient?

A

you can split them up and move them around, such as moving one inside the other and this can improve the query speed (especially if you have indexes on both)
- so push selections as far down the tree as possible, this removes as many irrelevant tuples earlier on in execution and we can use indexing which is very fast

58
Q

If you have a selection on a cross-product then you can move the selection into the part this selection is about, so if you have a selection of A = a of R cross product with S and A is in R how can you make these operations more efficient?

A

You can move the selection into the part of R and this is quite a bit more efficient.
- push protections as far down the tree as well

59
Q

then if you have a = b where a is an attribute from R and b is an attribute from S and you’re doing the cross product of R times S

A

You can combine them into an equijoin

60
Q

What is the purpose of a physical query plan?

A
  • A physical query plan adds information required to execute the optimised query plan.
    such as:
  • Which algorithm to use for execution of operators?
  • How to pass information from one operator to the other?
  • May also insert additional operations such as sorting, in some cases as it can make things much more efficient (if done early).
61
Q

How do we go from Logical Plans To Physical Plans?

A
  • first we generate many different physical query plans
  • then we estimate the cost of execution of each plan
  • we select the physical query plan with the lowest estimated cost
62
Q

How do we estimate the Cost of Execution?

A
  • we focus on the number of disk access operations
  • and the important parameters of the database
63
Q

What factors influence the number of disk accesses

A

◦ Selection of algorithms for the individual operators (equijoin is faster than natural join etc)
◦ Method for passing information around between the different parts of your query
◦ Size of intermediate results - one of the most critical factors

64
Q

What are some important parameters of the database

A

◦ Size of relations
◦ Number of distinct items per attribute in each relation (we can estimate this using statistics or go through the whole database)

65
Q

What is the formula for estimating the Size of a Selection?

A

You can estimate the size of A = a where a is some constant of R as:
- the size of R divided by the number of distinct values of this column A in R

66
Q

When is the formula for estimating the Size of a Selection
accurate?

A
  • if each value in A occurs equally often
67
Q

What is the formula for estimating the Size of joins?

A
  • A simple estimate for a natural join based on the sizes of R and S and a max number of distinct values in the common attributes could be the size of R multiplied by the size of S, divided by the maximum number of distinct values for A in R or S
68
Q

How to generate physical query plans?

A

◦ one way: explore all possibilities? So for each algorithm see how well it runs then select the best of them. This will take a long time!
◦ More sensible approache: top-down/bottom-up method, you start from the bottom and move to the top and whatever had been best so far is likely to still be the most efficient so we continue with that.

69
Q

Why does join order matter?

A

Because it’s a variable in the size of the runtime

70
Q

What are the two general ways to pass information from one operator to another?

A

Materialisation: write intermediate results out to the hard disk, then read it in again when you need to use it
Pipelining - “stream-based processing algorithm”- basically two bars where you can combine one thing with another

71
Q

How does pipelining work to pass information from one operation to another?

A

◦ First operation passes the resulting tuples directly to the next operation without reading things out to the hard disk
◦ To do with we need extra buffer space for each pair of adjacent operations to hold tuples we’re currently passing along from one relation to the other

72
Q

Why is pipelining faster than materialisation?

A
  • Because it doesn’t involve writing to or reading from the hard disk which takes longer because it’s further away than from main memory and it stores more information you have to search through
73
Q

Which side is it more efficient to select for an equijoin on if one side has a primary key as the attribute to sort on?

A

Better to do it on the side with the primary key as this means there’s no duplicate values which results in less intermediate results

74
Q

Where is it best to insert a sort algorithm?

A

t’s smart to do the sort algorithm after a relation has been first queried with a select operation, as this will leave less tuples to sort through.

75
Q

How do we improve a protection operation?

A
  • By using pipelining (because this lessens disk accesses)