Indexing Flashcards

(46 cards)

1
Q

true or false: each index is associated with a search key

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

true or false: search keys can consist of a set rather than a single value

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

true or false: multiple records cannot have the same search key value

A

false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

true or false: you can have multiple indexes on a file, each with its own search key

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

an index will have index entries, data entries and data records

what are the differences

A

the index entries will filter and provide data entries that meet the index’s search key. You will be given a collection of data entries that hold/points to the data records depending on which of the 3 alternatives you have for keeping data entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data entries have 3 alternative structures, what are they and describe them?

A
  1. <k, data record> (by value)
    - no need to follow pointers
    - you have the full data stored in the data entry
    - at most one index on a given collection of data records can use this method, otherwise data records are duplicated and there will be redundant storage and potential inconsistencies
  2. <k, rid> (by reference)
    - data entries are typically smaller than data records, therefore better than alt1 with large records
  3. <k, list of rids> (by list of references)
    - more compact than alt2 but leads to variable sized data entries even if search keys are fixed length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

indexed files can be sorted or hashed, how does that affect the way k* (data entries) are stored

A

k* are sorted by k for sorted indexed files and hashed by k for hashed indexed files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what does it mean to be a unique index

A

Search key contains a candidate key (a key that can uniquely identify a record)

ie. we only return one record with this index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

sorted index files: what is a clustered vs unclustered index, define the differences

A

For sorted index, data entries k* are always sorted by search key k

If data records are also sorted by k it is a clustered index, otherwise, unclustered
index

Note
- A file has at most one clustered index (a table can have only one because the physical order of the data can only be in one way)
- Alternative 1 leads to clustered index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

between clustered and unclustered indexes, which will have the lowest I/O

A

clustered, because data records having same k are likely found on the same page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the difference between dense and sparce

A

dense: there is at least one data entry for every record

sparce: every record does not have its own data entry. One index per page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why does sparce indexed always need to be clustered

A

because you don’t have a data entry point to everything, you need to be able to find records without pointers, so their position needs to be implied, which can only be done if its sorted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

true or false: alternative 1 leads to a dense index

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

can a file have more than one sparce index, why or why not

A

no, because you need to be sorted for a sparce indexes to work, but you can’t be sorted in 2 ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a composite search key

A

a search key that contains more than one field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

for a sorted indexed file why does binary search not work when doing a range search

A

because we don’t have sequential storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

how do you do a range search with a sorted indexed file, what is the point of the values and pointers

A

create an index file to search indexes,there will be values and pointers, depending on if you are greater or less than the value you wil follow a specifc pointer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

in a ISAM tree index file, what is an overflow page

A

if all leafs are full you will add new info in the overflow pages

these pages will not be sorted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

true or false: in a ISAM tree, even after a deletion, the index file page will remain unchanged

A

true. the file structure will never change, but if you delete an overflow page the overflow page will be deallocated

20
Q

true or false: overflow pages in a ISAM tree, keep their allocated space after all their data has been deleted

A

false, space gets deallocated

21
Q

in an index file, where is the data entries contained

A

in the leaf pages

22
Q

in a ISAM tree, why do we not need next-leaf pointers?

A

leaf pages are allocated sequentially

23
Q

are ISAM trees height balanced?

24
Q

How is
1. file creation
2. Searching
3. Inserting
4. Deleting

done in an ISAM tree structure

A

file creation: leaf pages allocates sequentially. sorted by search key, then index pages allocated, then space of over flow pages

search: start at root, use key comparisons to go to leaf

insert: find lead data entry belongs to, put it in there, use overflow pages if needed

delete: fins and remove from leaf, if empty overflow page –> deallocate the space

25
is an ISAM a dynamic or static tree structure?
static, insert and delete will only affect over flow pages, the tree itself will never change
26
are overflow pages ordered or unordered in a ISAM tree
unordered
27
are overflow pages in an isam tree sequentially allocated
no
28
is B+ tree a dynamic or static tree structure
dynamic, it will change and respond to changes in data size
29
what are the properties of a B+ tree
the tree must be - height balanced - maintain 50% occupancy in each node except for the root
30
true or false: B+ trees can support both equality and range-searches efficiently
true
31
what is d referring to in a b+ tree
d is the order of the tree. Use d to determine the minimum occupancy allowed. Minimum occupancy is : d <= m <= 2d
32
why do we aim to keep 50% occupancy for every node in a B+ tree
to not waste space
33
In a B+ tree what does the number of I/O depend on?
the height of the tree and whether or not you are alt 1 or alt 2/3 and if you are clustered or unclustered
34
in a B+ tree are they typically taller than they are fat or fatter than they are tall
trees are fatter than they are tall
35
describe the process of inserting a data entry into a B+ tree
1. find correct leaf 2. put data into L 2a. if L has enough room we are done 2b. not enough room in leaf, must split the leaf into L and L2. - redistribute the entries between L and L2 and copy the middle key to the parent -check i parents also need to be split and follow the same logic
36
what is the difference between splitting leaf nodes and splitting parent nodes after an insert in a B+ tree
in parent redistribute entries evenly but push up the middle key in leaf, redistribute entries evenly but copy up the middle key
37
describe the process of deleting a data entry form a B+ tree
1. start at the room and find the leaf L where the entry belongs 2. remove the entry - if L is still half full, done 3. redistribute by borrowing entries from your sibling node 4. if redistribution fails, merge L and it's siblings 5. if merged, must delete the entry pointing to L or sibling from the parent
38
what is bulk loading of a B+ tree
when you create a tree, loading one record one at a time takes a while and will not give you sequential storage for your leaves. Instead you bulk load where you can efficiently create a B+ tree. When initializing you need to, sort all data entries, insert a pointer to the first leaf page in the root page and continue to insert into the right most index page. when you fill up the page, split along the right most path
39
index pages can typically hold many more entries than leaf pages, why?
index entries don't have *, remember that * tells you that we are looking at a data entry. This data entry will contain a pointer to the data and a rid which is comprised of page id and slot number. Files with * just need to hold more information
40
if data entries are alternative 1, what could possibly happen to rids
they could possibly change rids = page number, slot number if you insert/delete/merge/split, you could be changing the page numbers
41
the pages in an index file will contain values and pointers, what is the relationship between the number of values in a page and the fanout (number of branches)?
fanout is equal to number of values + 1. ex. if you have 9 values, you have 10 pointers
42
describe the static hashing method in hash-based indexes for hashed files. include how insertion works for this static hashing
- # of primary pages is fixed - allocated sequentially - primary pages are never deallocated - will be overflow pages if needed - buckets are identified with h(k) mod N = bucket - h(k) is a hash function that works on the search key k - possible to develop long overflow chains which will degrade performance - if we are looking for an entry that has a long overflow chain, we will need to look at all the overflow pages to find the entry
43
describe the dynamic hashing method, extendible hashing include how insertion and deletion works for this method
EH uses a directory of pointers to point to buckets. We will split and rehash a bucket that overflowed, this may or may not lead to directory doubling good b/c - cheaper to double only the directory - only one page is split and rehash - no overflow pages need to keep track of global and local dept, we will need to double the directory when an insert causes local depth > global depth after an insert for delete - if we delete and it leaves an empty bucket, the bucket is merged with its 'split image' (pairs that only differ in the left most bit)
44
when will we face issues with extendible hashing
if we have skewed distribution, where many entries have the same hash value it can lead to an overflow chain because the directory can grow large and data entries remain unsplit
45
describe linear hashing
we splits the bucket in round-robin fashion, without using directory. we have a collection of hash functions that we will swap in depending on how many bits we need to represent a difference in buckets. there will be a next pointer that will point to the next bucket in line to be split. regardless of which bucket overflowed, we will split only the bucket indicated. will degrade in performance if distribution skewed
46
yes or no, in LH if a new overflow is created while we are handling a current overflow, will this new overflow trigger another split?
no