How is the reduction size calculated in database projections?
Tip: The reduction factor depends on the number of columns selected and their individual sizes.
What is the Selection Operation in databases and how is its reduction size calculated?
Note: Reduction factor calculation assumes uniform distribution of attribute values.
What are the different types of Selection Operations and their reduction factors in databases?
How do you estimate the result size for the union of two tables in a database?
How is the result size estimated for the intersection of two tables in a database?
How is the result size estimated for the difference of two tables (R−S) in a database?
How is the result size estimated for a join on attribute Y between two tables? Which assumptions are made about the data?
Basic Estimation (Uniformity & Independence):
- For tables R and S joined on attribute Y (R.y = S.y).
- Formula: = (1 / V(Y, R)) * TR * (1 / V(Y, S)) * TS
- V(Y, R) and V(Y, S): Column cardinalities of Y in R and S.
- TR and TS: Total number of tuples in R and S.
Summing Over All Values of Y:
- Extend calculation over all joinable values of Y by summing the formula.
Primary Key - Foreign Key Join (PK-FK Join):
- If Y is a PK-FK join, size is determined by distinct values of Y in the FK relation.
- Estimated Result Size: min(V(Y,R), V(Y, S)) * (1 / V(Y, R)) * TR * (1 / V(Y, S)) * TS.
- Assumes FK values in one table are a subset of PK values in the other.
Formalization for PK-FK Join:
- |R ⋈ S| = min{ TR (1 / V(Y, R)), TS (1 / V(Y, S)) }
Assumptions:
- Independence of columns.
- Uniform distribution of values in a single column.
- Inclusion assumption for FK: all FK values present in PK column.
What are histograms and their key characteristics in database systems?
What are Equi-Width Histograms and how can they be used for cardinality estimation?
Description: Histograms that divide the range of an attribute A into equal sub-ranges or buckets.
Implementation:
- Example: For a range 1-100, create 10 buckets each covering a range of 10 units (1-10, 11-20, etc.).
Cardinality Estimation:
- Estimate output cardinality of a range query by summing the number of tuples in relevant buckets.
- Assumes uniform distribution within each bucket.
- Example: For a query like ‘age > 35’, if the bucket 30-40 has 100 tuples, estimate 50 tuples for the range 35-40.
What is the complexity and error of Equi-withh histograms?
Complexity:
- Construction: Single pass, O(N) time, O(B) size.
- Range-query Processing: Need to scan the histogram (B).
- Maintenance: Determine the bucket and increment/decrement in O(1) time.
Error:
- Error: Can be arbitrarily bad; the error cannot be limited.
What are the limitations and potential errors in using Equi-Width Histograms in databases?
Limitations:
- Assumes uniform distribution within each bucket, which might not always hold true in real data distributions.
Potential Error:
- Error in cardinality estimation can be arbitrarily bad due to uneven distribution of values within buckets.
- Unable to limit the error margin effectively in diverse data distributions.
What are Equi-depth Histograms and how can they be used for cardinality estimation?
Title: Understanding Equi-Depth Histograms in Databases
Front:
What are Equi-Depth Histograms and how do they function in databases?
Back:
- Description: Histograms dividing the range of an attribute A into sub-ranges where each bucket contains approximately the same number of tuples.
- Cardinality Estimation:
- Use buckets to approximate the number of tuples within a query range.
- Difference from Equi-Width:
- Buckets are defined based on tuple count, not value range.
What is the complexity and error of Equi-depth histograms?
Complexity:
- Construction: Requires sorting, O(N log N) time, O(B) size.
- Range-query Processing: Scan the histogram, O(B).
- Maintenance: Reconstruct each time or adjust bucket counts, merge/split as needed.
Error:
- Estimated error can be at most ±|R|/B, providing a more predictable error margin.
What are the important considerations when using histograms in database systems?
Bucketing Scheme:
- How to partition: Buckets can overlap or be recursive to better reflect data structure.
Bucket Statistics:
- What to store: Beyond bucket boundaries and counts, additional information can be included.
Intra-Bucket Approximation Scheme:
- How to estimate answers: Typically uses a continuous-value assumption for frequency counts, but other methods are also viable.
Class of Queries Answered:
- Suitable for range-count and point-count queries.
- Not applicable for general SQL; only specific query types are supported.