Data Analysis and Mining Flashcards

1
Q

Use Cases

A

Prompts the use of data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Technology

A

Prompts the use of data warehouses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Aggregates in SQL

A
  • count()
  • max()
  • min()
  • avg()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Complex SQL Queries

A

Typically contain the use of aggregates, and examines a large amount of the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data warehouses

A

Type of database systems to support data analysis.
Typically is not updated constantly, but could be updated every day or so.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Extract transform load

A

Transforms data into a specific schema/pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

OLAP

A

Online analytic processing.
Refers to the process of analysing complex data stored in a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

OLAP Query

A

Complicated queries that touch a lot of data, discover trends and patterns in data and typically take a large amount of time to compute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

OLTP

A

Online transactions processing.
Typical DBMS tasks, where queries are typically fast and touch a small portion of the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unique Fact Tables

A

Used in OLAPs to store events and objects of interest for the analysis.
May be thought as representing a data cube, where the length, width and depth all represent different variables, like product, or dates, or store etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Star Schemas

A

One of the more common data warehouse architectures, containing unique fact tables which contain points in a data cube.
We could also have dimensional tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Dimensional Tables

A

Describe values along each axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Star Schema Example

A

Lets say we have the fact table Sales(productNo, date, store, price). We can split this further:
- productNo comes from Products(productNo, type, model)
- date comes from Days(date, day, week, month, year)
- store comes from Stores(name, city, country, phone) where “name” is being changed to store.
- price can be its own dependent attribute.
All of these attributes are called dimensions, and the dimension tables are what make up the fact table itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Denormalised Schemas

A

We save the data in multiple areas because we care more about making complex queries faster.
The main data is in one table, the fact table, and the rest of the data can be joined with fact table very quickly.
And all in all, joins are not required as much.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Slice and Dice technique

A

Another way to find data from a data cube.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Slicing

A

Narrows down our search of the data cube by slicing a section.

17
Q

Dicing

A

Dices the selected section further into different sections of data.

18
Q

Data mining

A

Extended form of OLAP.

19
Q

Data mining objective

A

Data mining takes a lot of data and tries to get answers to questions that you care about.

20
Q

Typical data mining queries

A

“Find factors that have had the most influence over sales of product X” rather than SQL queries.
Essentially, they are pure English queries where we don’t specify what we want, but we discover things directly from data.

21
Q

Data mining applications

A
  • Deviation detection -> identify anomalies.
  • Link analysis -> discover links between attributes.
  • Predictive modelling -> predict the future behaviour of certain attributes based upon past behaviour.
  • Database segmentation -> group data by similar behaviour.
22
Q

Types of discovered knowledge

A
  • association rules
  • classification hierarchies
  • sequential patterns
  • clustering
23
Q

Classification hierarchies

A

An example could be mutual funds based on performance data characteristics such as growth, income etc. Essentially, it is ordering what we care about.

24
Q

Sequential Patterns

A

An specific set of patterns, lets say A, B and C, leading to an outcome D.

25
Q

Clustering

A

Grouping data together in clusters.

26
Q

Market-basket model

A

A type of data mining technique, which uses Market-Basket Data.

27
Q

Market-basket Data

A

Market-Basket Data can be described by a set of items I, and a set of baskets B, with each basket being a subset of I.

28
Q

Market-basket data example

A

Purchase ID ; Items Brought
101 ; milk, bread, cookies, juice
792 ; milk, juice
1130 ; milk, bread, eggs
1735 ; bread, cookies, coffee
Items (I) = {milk, bread, cookies, juice, eggs, coffee}
Baskets (B) = b1, b2, b3, b4, where bn represents each collection of “items bought”.

29
Q

Frequent-itemset mining

A

Questions like “Which items occur frequently together in a basket?”

30
Q

Frequent-itemset mining support

A

number of baskets in B containing all items j / number of baskets in b.

31
Q

Frequent-itemset mining example part 1

A

We first define “how frequently is frequent?”
We can do this by setting some arbitrary number to s. So lets say we wanted to figure out how often bread is in a basket, and we wanted to say “one in every two”, we could set s = 0.5.

32
Q

Frequent-itemset mining example part 2

A

So lets say we have s = 0.5, which is us saying whatever we specify should be once every two times. Now lets go back to our other example:
101 ; milk, bread, cookies, juice
792 ; milk, juice
1130 ; milk, bread, eggs
1735 ; bread, cookies, coffee

If we were to ask “is buying milk and juice together frequent”, we would then figure out the support for J = {milk, juice}, which calculates to:
2/4, since 101 and 792 both contain all items.
This meets out threshold, 0.5, and we can deem it as frequent.

33
Q

More complex tables

A

Data in tables is more complex, typically with useless information at times. For example, our previous table can look like this instead:

Purchase ID ; Customer ID ; Items Brought
101 ; A ; milk, bread, cookies, juice
792 ; B ; milk, juice
1130 ; A ; milk, bread, eggs
1735 ; C ; bread, cookies, coffee

where we add the additional Customer ID colun.

34
Q

WRT (with example)

A

Now we can include With Respect To, and decide which values take priority.
Using our ongoing example, we could have Purchase ID taking priority, and our table would be non-different (except for the removal of the table Customer ID) since there are no common elements.
However, using Customer ID leads to A appearing twice, changing the table to
A ; milk, bread, cookies, juice, eggs
B ; milk, juice
C ; bread, cookies, coffee
Using this table can result in different frequency answers. For example, using {milk, juice} will now result in 2/3, making it even more frequent.

35
Q

Association Rules

A

General questions which focus on {i1, i2, …} -> j.
In plain English, for example, we could have “customers who buy diapers frequently also buy beer”, or “people who buy game of throne and harry potter also buy twin peaks”.

36
Q

Association Rule Properties

A
  • support of {i1, i2, …, j} -> ideally, we want a high support.
  • confidence
37
Q

Confidence

A

Confidence is the percentage of baskets for {i1, i2, …} containing j, and is wrote as:

support of {i1, … in, ij} / support of {i1, … in}

This should also be high and should differ significantly from the fraction of baskets containing j. If they are relatively similar, then it is close to independent whether you buy this item or not.

38
Q

Association Rule Example

A

Question is {milk} -> juice
Using:
101 ; A ; milk, bread, cookies, juice
792 ; B ; milk, juice
1130 ; A ; milk, bread, eggs
1735 ; C ; bread, cookies, coffee

Support:
{milk, juice} = 2/4 = 0.5

Confidence:
support of {milk, juice} / support of {milk}
(2/4) / (3/4) = 2/3 = 0.67

Result:
67 percent of all customers who bought milk also bought juice.

39
Q

A-Priori Algorithm

A

Compute all itemsets J with support >= s.
If J has support >= s, then all subsets of J have support >= s.