Data-warehousing/analysis/mining Flashcards

Question 1

Q

What do queries run by big stores or hospitals tend to feature?

Answer

A

they tend to be complex: use aggregates and other advanced features
they tend to examine large parts of the data
they tend to block large parts of the database

Question 2

Q

What types of queries should we avoid executing on DBMS that serve a lot of customers simultaneously?

Answer

A

Running very complex queries that examine and block large parts of the database

Question 3

Q

What are data warehouses?

Answer

A

Systems designed to support data analysis to answer very complex queries that examine large parts of the database

Question 4

Q

How are data warehouses updated?

Answer

A

By getting the updated/new information from the smaller servers

Question 5

Q

What is OLAP?

Answer

A

(Online Analytic processing) OLAP refers to the process of analysing complex data stored in a data warehouse

Question 6

Q

what is an OLAP query?

Answer

A

An SQL query that is for an OLAP process?

Question 7

Q

What is OLTP?

Answer

A

Online transaction processing is used in traditional DBMS
OLTP queries only touch small (and often different) parts of the database that probably don’t overlap so you can therefore make a lot of fast queries simultaneously

Question 8

Q

What do OLAP applications typically feature?

Answer

A

A unique fact cube that represents events and objects of interest for the analysis.

Question 9

Q

Describe an example fact table with schema Sales(productNo, date, store, price)?

Answer

A

if it only has 3 attributes it has 3 dimensions so we can represent it at a data cube
so on one side of the cube the height could be productNo, the length could be date and the width could be price
this way, each point in the cube corresponds to the sale of a product

Question 10

Q

What do we get back when we query a data cube (fact table)with a range for each edge?

Answer

A

as it’s represented as a cube, we get a slice of the cube back that all satisfies the ranges in the query
we can then further query this smaller section of information

Question 11

Q

What are star schemas?

Answer

A

they have unique fact tables (contains points in the data cube)
Dimension tables: describe values along each axis of the cube
use star schemas to represent dimension tables

Question 12

Q

What are star schemas made up of?

Answer

A

name of the table
keys to virtual fact tables that answer queries (known as dimensions)
a dependent attribute that there is no other information about elsewhere
dimensions are attributes we do have other information about elsewhere

Question 13

Q

What exactly does a star schema a describe?

Answer

A

A database consisting of:
a fact table R(A1,…An, B1…Bm)
-What are star schemas?
- In essence A1 to An are foreign keys
- Each dimension can also have further dimensions of their own
- B1…Bm are dependent attributes

Question 14

Q

What are the Characteristics of Star Schemas?

Answer

A

They are denormalised (they have duplicate data)
gains: they don’t require many joins
faster and easier aggregation of data

Question 15

Q

Why don’t we mind that the data is denormalised in star schemas?

Answer

A

We don’t mind storing data in multiple places because we care more about making these type of queries fast

Question 16

Q

What is Dicing?

Answer

A

Dicing: When we query the schema and get this small slice of the cube back, then we cut it into even smaller pieces. Each small piece telling you something about a specific model

Question 17

Q

What is the difference between slicing and dicing?

Answer

A

Slicing is done by the where clause to get a slice of the data cube
Dicing is done by the group by clause to cut up the slice of data into groups

Question 18

Q

What is data mining?

Answer

A

Can be seen as extended form of OLAP
- Is more that you have a lot of data and you’re trying to find answers to questions that you care about. For instance instead of making sequences of queries, we just want a way to get this information directly.

Question 19

Q

Why do we use data mining?

Answer

A

To discover patterns/knowledge in data
- can use data mining to determine the reason behind a relationship that may not be obvious (for example nappies and beer)

Question 20

Q

If we have a table of films and which watchers enjoyed the films what could this help us to predict?

Answer

A

Streaming services may be interested in which viewers frequently like the same films.
As this can allow them to accurately recommend films to people

Question 21

Q

What are some applications of data mining?

Answer

A

Deviation Detection: Identify anomalies (e.g., intruders trying to break into a system)
Link Analysis: Try to discover links between attributes (e.g., association rules)
Predictive Modelling - Try to predict future behaviour of certain attributes in the data based on past behaviour
Database Segmentation: Group data by similar “behaviour” to better target their needs

Question 22

Q

What are the types of Discovered Knowledge from data mining?

Answer

A

Association rules
Classification hierarchies
Uses tree structures - if you’ve a similar path to others then you may behave similarly in the future (and continue following the same path of leaves)
Sequential patterns: Example: “If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he/she is likely to suffer from kidney failure within the next 18 months.”
Clustering - Example: Group treatment data on a disease based on similarity of side effects.

Question 23

Q

What are the basics of the market-basket model

Answer

A

Data can be described as a set of all the different items I
and a set of baskets B: Each basket b is subset of the items in I

Question 24

Q

What is Frequent-Itemset Mining?

Answer

A

frequence depends on the application in question and what information is wanted
but a support threshold (percentage) is set which defines if an object is frequent.
we want to check whether two items appear frequently together and if they do we can exploit this fact in some way

Question 25

Q

How do we define whether an item appears frequently?

Answer

A

Use formula to determine how frequently a subset appears
- frequence = times item appears / number of rows

Question 26

Q

If the support threshold = 1/2 and an item appears in 3/4 baskets what does this mean?

Answer

A

It means that the item is frequent because it appears >= support threshold (frequency level)

Question 27

Q

How does Frequent-Itemset Mining work with subsets of items?

Answer

A

if we also want to check if a subset is frequent we can use the same formula:
number of times subset appears / number of rows (baskets)
and then check if this percentage is >= to the frequency level

Question 28

Q

How do we calculate baskets if we have more complex information on the data such as the identities of the people who bought a basket of shopping (not just all the different baskets of shopping)?

Answer

A

A basket is the union of all of a specific customers’ shops, so if they bought {eggs, bread} in one shop and {milk} in the next shop, then their basket = {eggs, bread, milk}
- we can do group by on the customers’ identities and then discard the purchase ID column

Question 29

Q

Why is a table with baskets and specific customers more useful than a table with just every separate purchase?

Answer

A

It’s more accurate since we can calculate how many different people bought a certain item
we can also recommend items to specific customers using this method
We can also say whether items are frequent with respect to customers, not just frequent with all shops

Question 30

Q

What are association rules?

Answer

A

if a subset of items appear very frequently bought together but not bought at all on their own, we can say the two items have an association
because if one is bought, the other is more likely to be bought

Question 31

Q

What is the confidence?

Answer

A

the percentage of baskets of subsets of items containing an item

Question 32

Q

How do we calculate the confidence?

Answer

A

number of times the subset appears / number of times the item appears on it’s own

Question 33

Q

Why is it important that the association is high and that the support of {i} on it’s own is much lower that the support of {i,j}?

Answer

A

This association should be high (their link) for it to be worth exploiting (acting upon it to try to improve sales for example)
It should differ significantly from the fraction of baskets containing J because if it doesn’t then it’s more likely that J and I are independent of each other and not worth trying to exploit

Question 34

Q

If {BS} appears 4/8 times and {BS} appears 5/8 times, what is the confidence of

Answer

A

confidence for {BS} –> HP1 is (4/8) / (5/8) = 4/5

Question 35

Q

What is the goal of the A-Priori Algorithm?

Answer

A

Compute all the itemsets J with support >= s

Question 36

Q

What rule do we use to compute the A-Priori Algorithm?

Answer

A

that if J has a support >= s then all subsets of J have support >= S
You just write all the possibilities of subsets and then do the same with the subsets
You just keep doing this until you reach however a large subset you’re interested in

Question 37

Q

What does the A-Priori Algorithm have poor dependency on?

Answer

A

It has quite poor dependency on Q as it will take an exponential time and very many queries to complete

Brainscape's Knowledge GenomeTM

Data-warehousing/analysis/mining Flashcards

Brainscape's Knowledge Genome^TM