# Data Mining: Association Rules Flashcards

1
Q

Knowledge Discovery Process

A
2
Q

Analysis techniques, methods

A

Descriptive methods:

Extract interpretable models describing data, for example client segmentation.

Predictive methods:

Exploidt some known variables to predict unknown or future values of variables, for example spam emails.

3
Q

Attributes types

A

Nominal: ID, eye color, zip codes

Ordinal: Rankings, grades, size in {tall, medium, short}

Interval: calendar dates, temperatures in celsius

Ratio: temperature in Kelvin, length, time, counts

4
Q

Nominal attributes possesses

A

Distinctness

5
Q

Ordinal attribute possesses

A

distinctness, order

6
Q

Interval attribute

A

7
Q

Ratio attribute

A

8
Q

Data quality problems

A

Noise, outliers, missing values, duplicate data

9
Q

Important characteristics of structured data

A

Dimensionality: curse of dimensionality

Sparsity: Only presence counts

Resolution: Patterns depend on the scale

10
Q

Aggregation

A

Combining attributes into single one.

Purpose:

Data reduction: reducing attribute number ( sampling, feature selection, discretization)

Change of scale: from regions into states

Stability: aggregated data tends to be more stable (less deviation)

11
Q

Sampling

A

Samping is necessary because employing the entire data set is too expensive.

Sampling works if the sample set is representative of the entire dataset.

A sample is representative if it has approximately the same property as the original set of data.

Types:

Simple Random: Randomly selected

Without replacement: An object can be taken only once

With replacement: Same object can be takes more than once.

Stratified: Split data into several partitions, take random samples from each partition

12
Q

Dimensionality reduction

A

When dimensionality increases, data becomes more sparse in the space it occupies; Definition of distance and density between points become less meaningful. To prevent this we have dim. reduction:

Principal comp. analysis: find projection that captures largest amount of variation in data.

Singular value decomp.

Feature subset selection: remove redundant features and irrelevant features

13
Q

Feature subset selection techniques

A

Bruteforce: try all possible subsets as input of data mining algo

Emebedded: features are selected naturally by the data mining algo

Filter: features selected before algorithm is run

Wrapper: use algorithms as black box to get best subset

14
Q

Feature Creation

A

Create new attribute that represent better the inormation in the data set.

Feature Extraction: domain specific

Mapping Data to new space: for example Fourier Transform

Feature Construction: combine features

15
Q

Discretization

A

Split attribute domain from continuos into discrete.

Reduces cardinality of attribute domain.

Techniques:

• N intervals with same width (Incremental, easy to do, can be badly affected by outliers and sparse data)
• N intervals approx same cardinality (non incremental approach, good for sparse data and outliers)
• Clustering (fits wel sparse data)
16
Q

Attribute transformation

A

Function that maps attribute values to a new set of values;

Example: Normalization ( min-max, z-score, decimal scaling)

17
Q

Similarity/Dissimilarity for simple attributes

A
18
Q

Minkowski distance

A

r=1: city block (hamming distance)

r = 2: euclidean distance

r -> ∞: maximum distance between any component of the vectors.

19
Q

Properties of metrics

A
20
Q

Common properties of similarities

A

s(p,q) = 1 only if p = q

s(p, q) = s(q, p) for all p and q.

21
Q

Similarity between binary vectors

A

Simple matching:

SMC = matches/(number of attributes)

Jaccoard Coefficients:

J = (11 matches) / (number not both zero attribute values) = (M11) / (M01 + M10 + M11)

22
Q

Cosine similarity

A
23
Q

Combining similarities and combining weighted similarities

A
24
Q

Goal of association rules

A

Expoloratory technique.

Extraction of frequent correlations or pattern form a transactional database

Example:

diapers => beer

• 2% of transactions contains both items
• 30% of the ones containing diapers contais also beer
25
Q

transactional formats

A

A transaction can be a set of:

• Textual data
• Structured data
26
Q

Association rule: Itemset

A

Set including one or more items

27
Q

Association rule: k-itemset

A

Itemset with cardinality of k

28
Q

Association rule: Support count(#)

A

Frequency of occurrence of itemset

example #{Diapers, Beer} = 2

29
Q

Association rule: support

A

is the fraction of transactions that contain an itemset

example

sup({beer, diapers}) = 2/5

30
Q

Association rule, Frequent itemset

A

Is an itemset whose support is greater than or equal to a minsup threshold

31
Q

Association rule, rule quality metrics

A

Given A=>B

Support := #{A,B}/ |T| where |T| is the cardinality of the transactional database.

Confidence := sup(A,B)/sup(A), conditional probability of finding B having found A.

32
Q

Association rule extraction

A

Given a set of transaction, association rule mining is the extraction of rules satisfying the constraints:

• support >= minsup threshold
• confidence >= minconf threshold

Result is

• complete (all rules satisfying both contraints)
• correct (only the rules satisfying both contraints)
33
Q

Association rule, frequent itemset generation and extraction of association rules

A

Frequent itemsets

• many different techniques
• level-wise approaches (apriori)
• without candidate generation (fp-growth)
• other
• this is the most computationally expensive step

Extraction of association rules

• generation of all possible binary partitioning of each frequent itemset
• possibly enforcing a confidence threshold
34
Q

Association rule, Apriori principle

A

If an itemset is frequent, then all of its subsets must also be frequent.

• The support of an itemset can never exceed the support of any of its subsets
• it reduces then number of candidates.
35
Q

Association rule, apriori algo

A
1. Candidate generation
- Join step: generate candidates of len k+1 by joining frequent itemsets of len k
- Prune step: apply apriori principle := prune len k+ 1 that contain at least on k-itemset that is not frequent.
2. Frequent itemset generation
- ScanDB to count support for k+1 candidates
- prune candidates below minsup

Counting support of candidates:

• candidate itemsets are stored in a hash-tree
• subset function finds all candidates contain

Performance issues:

Candidate sets may be huge

• 2-itemset generation is the most critical
• extracting long frequent itemsets requires generating all frequent subsets

Multiple database scans: n+1 scans when longest frequent pattern len is n

Factors affecting performance:

• min support threshold
• dimensionality
36
Q

Association rule, FP-growth algo

A

Exploits compressed representation of the database = FP-TREE -> high compression fo dense data distributions, complete representation for frequent pattern mining

Frequent pattern mining:

1. recursive visit of fp-tree
2. applies divide and conquer approach

Only 2 DB scans: cout item support + fp-tree build

Fp-tree is just a Trie that counts how many childs a node has. Also each node with key k points to another node with key k and to the corresponding element on the header table.

The header table counts the frequency of each single item.

Alog:

Scan header table from lowest support item up.

For each item in header table extract freqeunt itemsets including item i and items preceding it in header table.

37
Q

Association rule, vertical data layout

A
38
Q

Association rule, Frequent Itemset

A
39
Q

Association rule, closed itemset

A

An itemset is closed if none of its immediatesupersets has the same support as the itemset

40
Q

Association rule, maximal, closed itemsets set

A
41
Q

Association rule, effect of support threshold

A

minsup too high: itemsets including rare but interesting items may be lost

minsup is too low: too much data -> computationally expensive.

42
Q

Confidence

A

Objective measure.

Not always reliable, results are influenced by the cardinality of a set of data points.

43
Q

Correlation

A

Subjective measure:

r: A => B

correlation = P(A|B) / (P(A)P(B)) = conf(r) / sup(B)

correlation = 1 => statistical independence

correlation > 1 => positive correlation

correlation < 1 => negative correlation.

44
Q

Weighted association rules

A

Consider item/transaction weights during association rule estraction

Extend rule quality measures -> weighted support, weigheted confidence.

Apply ad-hoc weight aggregation -> min, max, avg.

45
Q

Hierarchical association rules

A

Enable aggregation over attributes in a dataset…

Typically user provided

Example: time hierarchy, product category, location hierarchy.

46
Q

Generalized itemsets association rules

A

Sets of item at different generalization levels.

Generalized itemset covers a transaction when:

• all its generalized items are ancestors of items included in the transaction
• its data items are included in the transaction

Generalized itemset support:

• ratio between number of covered transactions and dataset cardinality.
47
Q

High level, cross level, low level rules

A

High level -> Only generalized itemsets (high level info)

Cross level -> generalized items and data items combined

Low-level reles -> only data itemsets.