Data-warehousing/analysis/mining Flashcards

1
Q

What do queries run by big stores or hospitals tend to feature?

A
  • they tend to be complex: use aggregates and other advanced features
  • they tend to examine large parts of the data
  • they tend to block large parts of the database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What types of queries should we avoid executing on DBMS that serve a lot of customers simultaneously?

A
  • Running very complex queries that examine and block large parts of the database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are data warehouses?

A
  • Systems designed to support data analysis to answer very complex queries that examine large parts of the database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are data warehouses updated?

A
  • By getting the updated/new information from the smaller servers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is OLAP?

A

(Online Analytic processing) OLAP refers to the process of analysing complex data stored in a data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is an OLAP query?

A
  • An SQL query that is for an OLAP process?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is OLTP?

A
  • Online transaction processing is used in traditional DBMS
  • OLTP queries only touch small (and often different) parts of the database that probably don’t overlap so you can therefore make a lot of fast queries simultaneously
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do OLAP applications typically feature?

A
  • A unique fact cube that represents events and objects of interest for the analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe an example fact table with schema Sales(productNo, date, store, price)?

A
  • if it only has 3 attributes it has 3 dimensions so we can represent it at a data cube
  • so on one side of the cube the height could be productNo, the length could be date and the width could be price
  • this way, each point in the cube corresponds to the sale of a product
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do we get back when we query a data cube (fact table)with a range for each edge?

A
  • as it’s represented as a cube, we get a slice of the cube back that all satisfies the ranges in the query
  • we can then further query this smaller section of information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are star schemas?

A
  • they have unique fact tables (contains points in the data cube)
  • Dimension tables: describe values along each axis of the cube
  • use star schemas to represent dimension tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are star schemas made up of?

A
  • name of the table
  • keys to virtual fact tables that answer queries (known as dimensions)
  • a dependent attribute that there is no other information about elsewhere
  • dimensions are attributes we do have other information about elsewhere
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What exactly does a star schema a describe?

A

A database consisting of:
a fact table R(A1,…An, B1…Bm)
-What are star schemas?
- In essence A1 to An are foreign keys
- Each dimension can also have further dimensions of their own
- B1…Bm are dependent attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the Characteristics of Star Schemas?

A
  • They are denormalised (they have duplicate data)
  • gains: they don’t require many joins
  • faster and easier aggregation of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why don’t we mind that the data is denormalised in star schemas?

A

We don’t mind storing data in multiple places because we care more about making these type of queries fast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Dicing?

A
  • Dicing: When we query the schema and get this small slice of the cube back, then we cut it into even smaller pieces. Each small piece telling you something about a specific model
17
Q

What is the difference between slicing and dicing?

A
  • Slicing is done by the where clause to get a slice of the data cube
  • Dicing is done by the group by clause to cut up the slice of data into groups
18
Q

What is data mining?

A

Can be seen as extended form of OLAP
- Is more that you have a lot of data and you’re trying to find answers to questions that you care about. For instance instead of making sequences of queries, we just want a way to get this information directly.

19
Q

Why do we use data mining?

A

To discover patterns/knowledge in data
- can use data mining to determine the reason behind a relationship that may not be obvious (for example nappies and beer)

20
Q

If we have a table of films and which watchers enjoyed the films what could this help us to predict?

A
  • Streaming services may be interested in which viewers frequently like the same films.
  • As this can allow them to accurately recommend films to people
21
Q

What are some applications of data mining?

A
  • Deviation Detection: Identify anomalies (e.g., intruders trying to break into a system)
  • Link Analysis: Try to discover links between attributes (e.g., association rules)
  • Predictive Modelling - Try to predict future behaviour of certain attributes in the data based on past behaviour
  • Database Segmentation: Group data by similar “behaviour” to better target their needs
22
Q

What are the types of Discovered Knowledge from data mining?

A
  • Association rules
  • Classification hierarchies
    Uses tree structures - if you’ve a similar path to others then you may behave similarly in the future (and continue following the same path of leaves)
  • Sequential patterns: Example: “If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he/she is likely to suffer from kidney failure within the next 18 months.”
  • Clustering - Example: Group treatment data on a disease based on similarity of side effects.
23
Q

What are the basics of the market-basket model

A
  • Data can be described as a set of all the different items I
  • and a set of baskets B: Each basket b is subset of the items in I
24
Q

What is Frequent-Itemset Mining?

A
  • frequence depends on the application in question and what information is wanted
  • but a support threshold (percentage) is set which defines if an object is frequent.
  • we want to check whether two items appear frequently together and if they do we can exploit this fact in some way
25
Q

How do we define whether an item appears frequently?

A

Use formula to determine how frequently a subset appears
- frequence = times item appears / number of rows

26
Q

If the support threshold = 1/2 and an item appears in 3/4 baskets what does this mean?

A
  • It means that the item is frequent because it appears >= support threshold (frequency level)
27
Q

How does Frequent-Itemset Mining work with subsets of items?

A
  • if we also want to check if a subset is frequent we can use the same formula:
  • number of times subset appears / number of rows (baskets)
  • and then check if this percentage is >= to the frequency level
28
Q

How do we calculate baskets if we have more complex information on the data such as the identities of the people who bought a basket of shopping (not just all the different baskets of shopping)?

A

A basket is the union of all of a specific customers’ shops, so if they bought {eggs, bread} in one shop and {milk} in the next shop, then their basket = {eggs, bread, milk}
- we can do group by on the customers’ identities and then discard the purchase ID column

29
Q

Why is a table with baskets and specific customers more useful than a table with just every separate purchase?

A
  • It’s more accurate since we can calculate how many different people bought a certain item
  • we can also recommend items to specific customers using this method
  • We can also say whether items are frequent with respect to customers, not just frequent with all shops
30
Q

What are association rules?

A
  • if a subset of items appear very frequently bought together but not bought at all on their own, we can say the two items have an association
  • because if one is bought, the other is more likely to be bought
31
Q

What is the confidence?

A

the percentage of baskets of subsets of items containing an item

32
Q

How do we calculate the confidence?

A

number of times the subset appears / number of times the item appears on it’s own

33
Q

Why is it important that the association is high and that the support of {i} on it’s own is much lower that the support of {i,j}?

A
  • This association should be high (their link) for it to be worth exploiting (acting upon it to try to improve sales for example)
  • It should differ significantly from the fraction of baskets containing J because if it doesn’t then it’s more likely that J and I are independent of each other and not worth trying to exploit
34
Q

If {BS} appears 4/8 times and {BS} appears 5/8 times, what is the confidence of

A

confidence for {BS} –> HP1 is (4/8) / (5/8) = 4/5

35
Q

What is the goal of the A-Priori Algorithm?

A

Compute all the itemsets J with support >= s

36
Q

What rule do we use to compute the A-Priori Algorithm?

A
  • that if J has a support >= s then all subsets of J have support >= S
  • You just write all the possibilities of subsets and then do the same with the subsets
    You just keep doing this until you reach however a large subset you’re interested in
37
Q

What does the A-Priori Algorithm have poor dependency on?

A

It has quite poor dependency on Q as it will take an exponential time and very many queries to complete