- Compute an intermediate result for each node - For a leaf labeled with relation R, the intermediate result is R - For an inner node labeled with operator op, get the intermediate result by applying op to the children's intermediate results. - Result of query -> intermediate result of the root.

- B = size of the disk block (512 to 4096 bytes) - M = number of disk blocks that fit into available RAM.

Query Processing Flashcards by Josh Kay

Query Processing

This is where we look at the query compiler, and execution engine. This is responsible for transforming SQL queries into sequences of database operations and then execute the queries.

How well did you know this?

Not at all

Perfectly

Queries

In queries, we typically tell the DBMS what we want, not how we want it.
We’re typically going to look at select statements.

How well did you know this?

Not at all

Perfectly

Relational Algebra

Set of operations that can be applied to relations to compute new relations (things like select, projection, natural join, union etc).

How well did you know this?

Not at all

Perfectly

Relational Algebra Expression

Often called a logical query plan, and are often represented as trees.

How well did you know this?

Not at all

Perfectly

Logical Query Plan example

SELECT department, name
FROM Stores, Employees
WHERE department=worksAt AND city=’Liverpool’

can be converted to

π_(department, name)(σ_(department=worksAt AND city=’Liverpool’)(Stores x Employees)

How well did you know this?

Not at all

Perfectly

Selection, used for conditions.

How well did you know this?

Not at all

Perfectly

Projection, restricts attributes in attribute list.

How well did you know this?

Not at all

Perfectly

Renaming, very simply used for representing one thing as another (like using AS in SQL).

How well did you know this?

Not at all

Perfectly

Cartesian product, pairs each tuple together (like using SELECT DISINCT * FROM _)

How well did you know this?

Not at all

Perfectly

Query Plan steps

Compute an intermediate result for each node
For a leaf labeled with relation R, the intermediate result is R
For an inner node labeled with operator op, get the intermediate result by applying op to the children’s intermediate results.
Result of query -> intermediate result of the root.

How well did you know this?

Not at all

Perfectly

Equijoins

Represented as R ⋈_a=b S.
Joins things together based upon attributes given (in this case, a and b).

How well did you know this?

Not at all

Perfectly

Merging (merge sort)

Way of understanding equijoin:
1) Check values of a and b. If a is smaller than b, move down in a, and if b is smaller than a, move down in b.
2) If a and b are equal, we merge the cell of a and b, and move down in b.
3) Repeat
If there are duplicates in a, we remember that they are duplicates and repeat the b table again

How well did you know this?

Not at all

Perfectly

Fast Join Algorithms

There is a fast join algorithm which is the sort join algorithm, and can be a fast way of computing R ⋈_a=b S.

How well did you know this?

Not at all

Perfectly

Other Join Algorithms

Index join
Hash join
Multiway join

How well did you know this?

Not at all

Perfectly

Relations on disk

Relations are stored on disk, and RAM is not often big enough to store everything, but if we stored everything on disk then it would take longer than if on RAM.

How well did you know this?

Not at all

Perfectly

Disk parameters

Study These Flashcards

B = size of the disk block (512 to 4096 bytes)
M = number of disk blocks that fit into available RAM.

Reading a relation R

Study These Flashcards

No. of elementary operations =
O(|R|)
No. of disk access =
O(|R|/ B)

Sorting R on attribute A

Study These Flashcards

No. of elementary operations =
O(|R| log_2 |R|)
No. of disk access =
O((|R|/ B) log_2 (|R|/ B))

O(|R|)

Study These Flashcards

Since we are reading them one by one linearly.

O(|R| log_2 |R|)

Study These Flashcards

O(n log n), based upon sort join algorithm.

O(|R|/B)

Study These Flashcards

We can just read the blocks instead. Since it under the presumption that each record has a constant size.

O((|R|/ B) log_2 (|R|/ B))

Study These Flashcards

We are sorting all disk blocks, not individual records.

Index

Study These Flashcards

Given the values for one or more attributes of a relation R, provides quick access to the tuples with these values.

Types of index

Study These Flashcards

Primary -> defines how data is sorted on disk.
Secondary -> merely points to location records on disk.

Forms of index

B+ Trees Hash Tables

B+ Trees

Good if select condition specifies a range, and it most widely used.

Hash Tables

Good if selection involves equality only.

Virtual Columns

If we have indices over multiple columns (instead of doing one column, like year, we can do two columns, like programme), we create a virtual column which concatenates both of them, with the first one being priority over the second one (like using GROUP BY).

Heuristics

- Push selections as far down the tree as possible - Push projections as far down as possible, or insert projections where appropriate. - Where possible, introduce equijoins for cross product followed by selections.

Physical Query Plan

Adds information required to execute the optimised query plan. Things like: - Which algorithm to use for execution of operators (naive selection or index selection, nested block join or sort join or hash join) - How to pass information from one operator to another (write to disk, keep in memory, pipelining operators) - Good ordering for computer joins, unions etc - Additional operations such as sorting are all considered.

Using physical query plans for cost

We generate many different physical query plans and estimate the cost of execution for each plan (time, disk access, memory etc). From then, we select the lowest cost.

Estimation cost of execution

Relies on number of disk access operations.

Number of disk operations influences

Number of disk accesses are influenced by many factors: - Selection of algorithms for the individual operators. - Method for passing information. - Size of intermediate results.

Disk access operation parameters

- Size of relations - Number of distinct items per attribute per relation.

Projection example

For σ_a=b(R) |R| / number of distinct values in column a of relation R.

Joins example

For R ⋈ S, we have: (|R| x |S|) / max number of distinct values for A in R or S.

How to generate physical query plans?

A sensible approach is to go either top-down or bottom-up: - Selection for a suitable algorithm for each operator based upon size of intermediate result - Select of a good join order based upon size of intermediate result

Passing Information

Either materialisation or pipelining.

Query Processing Flashcards

(38 cards)