Chapter 4 - Association analysis Flashcards

1
Q

What is a “market basket transaction”

A

In its simple form, it is a list of items along with an identifier of purchase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Briefly, what is association analysys all about?

A

Association analysis is all about discovering “Hidden” relationships among large data sets.

These relationships can be represented as sets of items present in many transactions, known as frequent transactions, OR as association rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is “frequent itemset”?

A

Frequent itemset refers to a potential discovery of what items are purchased most.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “association rule”?

A

Association rules suggests a relationship between itemsets. For instance:
{Burgerbrød} –> {Burgerkjøtt}.

The association rules suggests that there is a relation betwene these two items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is associaiton rule mining limited to market basket transactions?

A

Of course not. It is just a simple example. We use association rules in every type of data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two issues regarding association rule mining used for market basket transactions?

A

1) Discovering the patterns from a large data set is computationally expensive.

2) Some of the “rules” happen merely by chance and are of no importance. Therefore, some of the “rules” are more important than others.

We are primarily interested in the first part in this course.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is “binary representation”?

A

using binary variables/attributes for each possible item, indicating whether it is present in the basket or not. Since binary, we only care about presence, not magnitude.

In the case of market basket transactions, the binary representation is likely to be asymmetric. Asymmetric is about the fact that we dont give a fuck about the “not present 0”, but rather are much more interested in the “1’s”. This is partly because of the fact that, for instance, customers in a store only buy a very small subset of the items that the store has to offer. Therefore it makes little sense to actually care about the items they do not buy. It is much more interesting to watch patterns of items bought together as a direct consequence of the fact that that customers only buy small subsets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is “itemset and support count”?

What is support?

A

In association, a collection of zero or more items is called an itemset.

We can be more specific, and say that if the itemset has k items, it is called a k-itemset.

Support-count is the number of transactions supporting/containing a certain itemset.
In other words, it is a “count” datatype, and it holds the number of times that we find the specific itemset in the entire list of transactions.

Often, we are also interested in “support”, which is the ratio of the support count divided by the total number of transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is a k-itemset+

A

Just an itemset (set of 1 or more items) that acutally has k items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When do we call an itemset “frequent”?

A

It is frequent if its “support-value” is greater than some pre-specified value that we use as a threshold, called minsup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We say that a transaction tj contains an itemset X if…?

A

The transaction tj contains an itemset X if X is a subset of the itemset in the transaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

More technically, what is an “association rule”?

A

An association rule is an implication on the form X –> Y, where X and Y are disjoint sets, meaning they contain no same value.

We measure the STRENGTH of an association rule in terms of SUPPORT and its CONFIDENCE levels.

Support determines how often the association rule applies in a data set, while confidence determines how frequently items in Y appear in transactions that contain X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define the support of an association rule

A

Support of an association rule determines how frequently the rule is applicable in a data set:

The sigma value indicates the number of times the union XUY happens in the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define the confidence of an association rule

A

Confidence determines how frequently items in Y appear in transactions that contain X:

It looks like “the number of times both happen together, divided by the number of times only X happen”. Therefore, a high confidence means that “if X happens, Y happens”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Danger of low support is …?

A

That the rule happened merely by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can we say about association rules and causality?

A

Nothing.
The association rules tells us relationships that are likely to occur. It tells us nothing about what causes something, or what effects will happen.

if we buy burger, does this cause us to buy burger bread? Or is it hte burger bread that cause us to buy the burger?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Formally define the association rule mining problem

A

given a dataset containing a list of transactions T. Find all association rules that have support and confidence greater than our thresholds, minsup and minconf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do we increase performance of association rule mining?

A

We need to prune. This is essential because of how the number of possible rules is actually exponential in the number of items.

The pruning is about recognizing that the support of a rule X->Y is the same as the support for XUY, which is actually the same for many rules. Therefore, if we find that one is low in support, we can automatically prune all the others that share the same support expression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Elaborate on the common strategy in association rule mining

A

Decompose the problem into 2 major subtasks:
1) Frequent itemset generation
2) Rule generation

Frequent itemset generation is about finding all itemsets that satisfy the support level of our threshold, minsup.

Rule generation aims to extract all the high-confidence rules from the itemset that we generated in “frequent itemset generation”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is it useful to decouple the associaiton rule mining strategy into two parts?

A

We take advantage of the fact that, due to how the support is defined, we will have many subsets that have the same level of support. All of these subsets can be pruned away if the itemset is infrequent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What structure can we use to enumerate all possible itemsets?

A

A lattice structure.
A data set that contains k items can at max generate 2^k-1 frequent itemsets, excluding the null set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a candidate itemset?

A

A candidate itemset is a possible frequent itemset. It is just any itemset. We call it candidate before we have counted the support and dtermined if it is frequent or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define “transaction width”

A

The number of items present in a transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explian brute force frequent item generation

A

We check every possible candidate set against every transaction. This is terrible, because it requires insane computation time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can we reduce computations in “frequent item generation”?

A

3 ways:

1) Reduce the number of candidate sets (Apriori)
2) Reduce the number of comparisons
3) Reduce the number of transactions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Elaborate on the Apriori principle

A

If an itemset is frequent, then all of its subsets must also be requent.

recall the support-count definition: Number of times an itemset occurs in the transaction list. Support is defined as the support-count divided by the total number of transactions.

The apriori principle ALSO states that if some subset is infrequent, then all of its supersets must also be infrequent.
Directly from this result, we know that if we discover that {a,b} is INFREQUENT; then we can PRUNE every superset that contains {a,b}. This is called support-based pruning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is “support-based pruning”?

A

Using support levels to determine if a subset is frequent or infrequent, and use the result to prune away supersets in the entire space of itemsets.

Support based pruning is possible because the support of a superset NEVER exceed the support for its subsets. This is called the anti monotone property of support measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Statement (not question):

The support for an itemset NEVER exceeds the support for any of its subsets.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the “anti-monotone property” of the support measure?

A

A measure “f” possesses the anti-monotone property if for every itemset X that is a proper subset of Y, we have f(Y) <= f(X).

It is the anti-monotone property of support-measure that allows us to use support-based pruning. The statement says that a superset can never be more frequent than a regular set, or a subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How does Apriori “start”?

A

Apriori considers every itemset to be 1-itemsets. That is, it first looks at individual support-counts, meaning that it looks at individual items and their support counts.

If a support is lower than minsup, discard immediatly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Elaborate on how Apriori runs its “frequent itemset generation”

A

We build up larger and larger itemsets.

We start with size 1, meaning we set up all possible 1-itemsets.

Then we sweep the transaction list ot count support of the 1-itemsets.
We keep the frequent ones. Discard the infrequent ones.

use the frequent 1-itemsets to generate new 2-itemsets. We can do this with a merging technique.

For all the new 2-itemsets (candidate itemsets), we use the Apriori principle to prune. This means, we check all IMMEDIATE subsets of each 2-itemset. ALL, not just the ones we previously established. We need to ensure that these immediate subsets are frequent. If one is not, then we have to prune this 2-itemset.

After pruning using Apriori, we sweep the trnasaction list to count the support levels. We again keep the frequent ones, and use them to generate 3-itemsets.

Then we again find all the possible immediate subsets of the newly generated 3-itemsets, and check these subsets for frequency. we prune the ones that have one or more subset (immediate subset) as infrequent.

We perform these steps until no more itemsets can be generated.

32
Q

What is “candidate generation”?

A

Candidate generation generates new k-itemsets based on the frequent (k-1)-itemsets found in the previous iteration.

33
Q

How can we visually illustrate all itemsets in a list of items?

A

We can use a lattice. We have the 0-itemset at the top, then all the 1-itemsets, then all the 2-itemsets, etc …, and last we have the single n-itemset.

34
Q

Excluding the 0-itemset, how many different frequent itemset can a data set consisting of k items generate?

A

2^k - 1 frequent itemsets. This happens if the n-itemset is frequent.

35
Q

Just show

A
36
Q

How do we generate new candidate itemsets?

A

There are many ways to do this, but we require them to have the following properties:
1) Complete
2) Non-redundant

A candidate generation is said to be complete if it does not omit any frequent itemsets.

A candidate generation is said to be non-redundant if it does not generate duplicate itemsets/itemsets that are consisting of the same items.

In regards to procedures for itemset generation, there are a few choices.
1) Brute force: generate every possible combination, and then prune the unnecessary ones.

2) Fk-1 x F1 method: Extend each FREQUENT (k-1)-itemset with FREQUENT items that are not a part of the (k-1)-itemset. EX: if {Bread, Juice} is frequent and {milk} is frequent, we can make {Bread, Juice, Milk} and know that it will be frequent as well.
This is COMPLETE because we systematically extend frequent (k-1)-itemsets with frequent 1-itemsets, and we know that a frequent k-itemset consists of a frequent (k-1)-itemset and a frequent 1-itemset.

3) Fk-1 x Fk-1 method: This works by merging together two FREQUENT (k-1)-itemsets IF their first (k-2) items are identical, when these items are arranged in lexicographic order. The result would be a set consisting of the first common k-2 items, followed by the (k-1)’th value of both sets, in lexicographic order.

37
Q

Define an unnecessary itemset

A

We say that a generated itemset is unnecessary if it contains as much as one infrequent subset. This makes it unnecessary because the infrequent subset doesnt help us with anything. It cannot make the total itemset more frequent.

38
Q

How can we make sure to avoid duplicates with the Fk-1 x F1 method?

A

Lexicographic order. Then we need to inly accept mergers where the F1 part is larger than the last element of the Fk-1 part.

39
Q

What can affect computational complexity of the Apriori?

A

Low support thresholds.

Number of items (dimensions).

Number of transactions.

Average transaction width.

40
Q

Elaborate on rule generation.

A

Extracting association rules from a given frequent itemset.

Each frequent k-itemset Y can make a number of association rules: We partition the itemset Y into X and Y-X, such that {X} –> {Y-X} this rule SATISFIES THE CONFIDENCE THRESHOLD.

Each k-itemset can actually make 2^k-2 rules, as we exclude empty-set rules.

Rule generation NEVER scans the transaction table. Rule generation works only on frequent itemsets. Therefore, we only care about confidence. Recall definition of confidence - it only use support-count values.

THE RULE GENERATION PROBLEM: Although we could generate all possible rules and check them vs the minconf, the exponential size of the number of rules makes this not viable in larger cases. Therefore, we need the topic which we call “confidence based pruning”.

The confidence theorem says that if we have a rule X –> Y - X, and this rule is not confident, then NO SUBSET OF X can make this rule confident.

How can we use this theorem to prune possible rules? Since the theorem use a fact about the original X-set, and use it further on its subsets, we start with a big X set, typically with only one item as the consequent (no point in empty set as consequent, no info out of that). Then, if the X -> {singleItem} is not confident, we prune the entire path of subsets of X. This means, if we are this lucky on the first iteration, we save a lot of time.
If the rule IS confident, we want to create new rules by merging together current rules, creating a level-wise strategy.

THE MERGING: We merge by combining the consequence of some rule, with every other consequence. We do this for every rule. Then we remove from the antecedent any item that is in the consequence. When the merging is done, we continue by checking for pruning using confidence levels from pre-computed support-count values.

The end result is a list of high confidence rules that we get from the k-itemset.

41
Q

Give the confidence theorem

A
42
Q

Elaborate on how the Apriori generate rules

A

Apriori uses a level-wise approach to generate rules. Each level corresponds to the number of items included in the consequents of the rules. For the first level, the consequent therefore contain only 1 item.

Initially, all the high confidence rules have only 1 item in the consequent are extracted. Then the Apriori method use these rules to generate new rules:

Say we have {a,b,c,d} as a frequent itemset. Suppose we have extracted the rules {a,c,d} –> {b} and {abd}–> {c}, we can generate a new rule by merging together the consequents: {a,d} –>{b,c}.

this level-wise approach is useful because of the Confidence theorem. The confidence theorem says that if X–>Y-X is of low confidence, then X’’–>Y-X’’ is also low confidence. This means, if we start with X as large as possible, meaning that Y-X is of size 1, we can quickly prune away large parts if the first rule happens to be of low confidence.

43
Q

What is a maximal frequent itemset?

A

A frequent itemset is said to be maximal IF none of its immediate supersets are frequent. Meaning, if {a,b} is frequent, it is maximal frequent if we cannot add a single item to it and make the superset be frequent.

we use them (maximal frequent itemsets) for compact representation.

44
Q

What are “closed itemsets”?

A

An itemset X is closed if none of its immediate supersets have exactly the same support count as X.

Recall that supersets can never have a “better” or more frequent support than subsets. Therefore, is soem itemset is closed, it means that the supersets have worse support.

If we have a list of ALL closed itemsets, than we have the support count of every itemset.

NB: this set of itemsets can still be extremely large.

45
Q

What is a closed frequent itemset?

A

An itemset is a closed frequent itemset if it is closed and its support is greater than the threshold vlaue minsup.

46
Q

What is the strength and weakness of the APRIORI algorithm?

A

Strength is that it prunes a great deal.

Weakness is that it requires significant overhead in regards to the many scans over the transaction table.

47
Q

What is the difference between support count of X –> Y and XUY?

A

Nothing, support count is the same.

This is an important result. It means that finding the support of an itemset is the same as finding the support for the association rule. Therefore, we can solve the association rule problem by finding the support of the different itemsets, and then check for threshold minsup, prune away if bad, compute confidence if good.

48
Q

Informally describe how Apriori runs its frequent itemset generation

A

Apriori starts with considering the 1-itemsets (just individual values). It will count these by looping through the entire transaction list. Doing this will reveal the frequency of the differnet 1-itemsets.

At this point, Apriori DISCARDS the 1-itemsets that does not meet the minsup threshold. This is the appropriate action because of the Aprior principle, that states that if some set X is infrequent, all of its supersets are infrequent as well. Makes a lot of sense.

So, building on the remaining 1-itemsets, it will start to create 2-itemsets. say we have 4 remaining 1-itemsets. Then we can create 4C2 = 6 candidate 2-itemsets. Then we do the same thing again: We discard the ones that are below the minsup threshold.

THIS ALGORITHM have 2 primary areas of concern:
1) The generation of new itemsets.
2) The counting of support.

49
Q

Elaborate on “candidate generation” from Apriori

A

We use the previous (k-1)-itemsets to create the new set of k-itemsets.

There are multiple strategies to generate new candidate itemsets, and we consider 3:
1) Brute force
2) Fk-1 x F1
3) Fk-1 x Fk-1

IMPORTANT: All candidate generation methods needs to possess the following properties:
1) Completeness: Do NOT omit any frequent itemsets.
2) Non-redundant: NEVER generate the same itemset more than once.

Brute force serves just as an example of what one could do. Brute force would consider every combination, dCk, where d is the total number of items, and k is the current level.

Fk-1 x F1 is a merging technique that use Fk-1, which is the set of (k-1)-itemsets, and merge these sets together with items from the F1. In other words, extend the (k-1)-itemset with one item to create a k-itemset. Obviously, since sets are unions, we have to extend with an item that is not already in the set.
This procedure is great because it is complete. it is complete because every frequent k-itemset is composed of a frequent (k-1)-itemset and a frequent 1-itemset.
However, it is NOT non-redundant. However, we can make it non-redundant by ordering the items in lexicographic order.

Fk-1 x Fk-1 is the one we actually use. It merge two (k-1)-itemsets together IF their (k-2) first items (sorted in lexicographic order) are identical. ‘

50
Q

How many rules can a single k-itemset generate?

A

2^k - 2.

Negative 2 is the x –> Ø, and Ø –> X rules.

51
Q

Regarding the candidate itemsets that survives the candidate pruning, what happens then+

A

We move to support counting. In this course, we just consider looping through the transaction list and counting occurences. But know that there are other optinos that are much better.

52
Q

Say we have an itemset, Y. How does this go on to become rules?

A

IF the itemset is frequent, it can generate rules by partitioning itself into two sets that are non-empty, X –> Y - X, such that X–>Y-X satisfies the threshold of confidence.

This means, we consider the frequent itemsets, and then we structure them into partitions that also satisfies the minconf threshold.

For instance, say we have the itemset: {a,b,c}. There are 6 CANDIDATE association rules here: {a,b}->{c}, {a,c}->{b}, {bc}->{a}, {a}->{b,c}, {b}->{a,c}, {c}->{a,b}. We know that the support of these is IDENTICAL to the support of the itemset.

IMPORTANT: Computing the confidence of these rules DO NOT require additional scans of the data. We know that confidence is defined as the XUY count divided by the X count. These support counts have been calculated earlier, when we generated the frequent itemsets.

53
Q

How does pruning work in the rule generation work?

A

Rule generation is a level-vise approach where we start with finding the rules with one value as the consequent. Then, if the rule is low confidence, we discard. If high confidence, we continue with it.
To get to the next level, we merge two rules of high confidence together.

NB: This is a systematic way of checking rules in a way where we can prune away rules we know will never work. What we actually do, is checking all rules from the itemset for confidence. However, by first checking the single item as RHS (consequent), if it fails the confidence test, then we prune the entire tree further down. We simply remove this rule from the further merging.

So, in other words, we Could in theory use brute force and check every possible rule. However, we can easily not do this by the merging technique.

54
Q

This is an important question.

Elaborate on the APRIORI algorithm, and on the distinction between pruning-methods used in it.

A

The algorithm is computing the set of frequent itemsets.

The algorithm follow the following steps:
1) Count 1-itemsets, discard if smaller than the threshold. This is the first type of pruning, and is really about “not entering paths that can never be frequent”. These path cannot ever be frequent because of the APRIORI principle. If the set is infrequent, we cannot add elements to it and make it more frequent. Adding elements only make it possibly less frequent.

2) We now have a list of 1-itemsets that are all frequent by the test of minsup. We want to use these 1-itemsets to generate the list of 2-itemsets. The question is: How do we do this? Generation of k-itemset by using (k-1)-itemsets is typically done by Fk-1 x Fk-1. This is a merging procedure. We do this merge FOR EVERY POSSIBLE pair of (k-1)-itemsets that satisfy the condition of sharing the first (k-2) elements. NB: requires lexicographic order.

3) Candidate pruning. Candidate pruning is done on some k-itemset by considering its immediate proper subsets, X - ij, for all j, where i is the item i. IMPORTANT: Candidate pruning is basically applying the APRIORI principle so that we reduce the number of counting we need to do in the transaction table. It is extremely efficient, because we only use values that we have previously computed (previously computed frequency/support counts) to examine the cases. here is an example: Say we have the sets {a,b} and {a,c} that are frequent, and we also have {b,c} which is infrequent. Therefore, we use Fk-1 x Fk-1 to merge and we get: {a,b,c}. The next step is the count for support, but BEFORE THIS, WE DO THE CANDIDATE PRUNING. Candidate pruning is done by examining all the immediate proper subsets, which is the following: {a,b}, {a,c}, {b,c}. Because {b,c} was found earlier to be infrequent, we know from the principle of APRIORI that the superset cannot be frequent. Therefore, we prune this path. Since it is the only result we have, we are “done” with the frequent itemset generation, resulting in {a,c}, {a,b}, {a}, {b}, {c} as frequent itemsets.

4) Now we do the counting. This involves checking the k-itemset against every transaction and finding the support-count, and then testing for the minsup. there are better ways though.

Note that candidate pruning is done WITHOUT actually computing the support values.

IMPORTANT

The candidate pruning is about doing what we can to reduce the amount of shit we need to check for in the transaction list. If we can manage to reduce the itemsets to check for by half, we’re massively increasing performance, especially on longer transaction lists. This is because of how the code would look like if we need to count many itemsets. Either we must run the transaction list once for each itemset, or we need to spend much time on each row. By doing candidate pruning, we reduce this load significantly.

55
Q

SHORT:

What “exactly” is candidate pruning?

A

Candidate pruning is examining a newly generated k-itemset’s k “immediate proper subsets”. We examine them by using previously computed support values. Then, we test this value against the minsup. Now we use apriori principle to prune. Recall APRIORI principle is that “if a set is infrequent, its supersets must be infrequent as well”.

Worth noting:
If we used the brute force to generate the itemset, we need to check “k” immediate proper subsets.
If we used Fk-1xF1, then we need only check “k-1” of the proper immediate subsets, since the Fk-1 support-count was computed on the earlier iteration.
If we used Fk-1 x Fk-1, then we need only check “k-2” of the immediate proper subsets, same reson.

56
Q

Elaborate on Maximal frequent itemset

A

The point is that if we are dealing with very large datasets, we might get an extremely large list of frequent itemsets generated. Therefore, it can be useful to find some way to represent the lot using a smaller set. This is where maximal frequent itemset comes into play.

Definition: A frequent itemset is “maximal” if none of its immediate supersets are frequent.

This definition means that given a maximal frequent itemset, we cannot add elements or sets of elements to it and still have it be frequent. it is as large as it can become.

The useful thing is that we can use these maximal frequent itemsets to derive all other frequent sets. I suppose this follows from Apriori.

When an itemset is maximal frequent, then every possible subset of it is also frequent.

57
Q

What can we say about support and maximal frequent itemsets?

A

Maximal frequent itemsets do not contain informaiton about the exact support count level of its subsets. The only informaiton is that all subsets meet the threshold value.

It is, in some cases, desirable to have a structure of minimal size that still maintain the support values of its subsets. This is where closed itemsets come into play.

58
Q

elaborate on ECLAT

A

ECLAT make use of vertical item-lists rather than horizontal. So, ECLAT use a table where each item has its own column, and each column is a vector of the rows (transaction rows, or transaction IDs) that contain this item.

when doing this, we can quickly find the support counts for different itemsets by taking intersections of the columns of interest.

59
Q

Elaborate on FP growth

(i dont think this is correct)

A

FP growth is an algorithm that runs on FP trees, and generate frequent itemsets.

Here is how it works:

The algorithm, in its most basic form, will do one of two things:
1) If the input-FP tree is a single straight path, then it will find all subsets of this path and set their support to be the lowest of the counts in this path. Then it will, for every subset, take the union of the subset and the current PREFIX, and store this as a frequent itemset.

2) If input is not a single path, it will go deeper and try to solve a subproblem. It will look at a smaller part of the FP tree, that is updated to consist of a subset of paths and new updated counts according to the item that we added to the prefix. When this tree is created, the algorithm will make a recursive call wit hthe current prefix, FP-tree and F as input variables.

Before either of these happen, the algorithm will use the pointers in the FP-tree to remove infrequent items.

Couple of important points:
When a single path is achieved, the algo will UNION the current prefix with the different subsets and add this combo to the set of frequent itemsets. The support given to the subsets is equal for all subsets, and will be equal to the lowest count of all subsets. Thus, the count will be the lowest count of the path.

When we add a frequent itemsets at the point where we did not get a single path, we add just a single itemset at each itertion in the loop, and the support will be equal to the support level of the item that we just unionied with the current prefix.

60
Q

Drawback of maximal frequent itemset?

A

No information regaridng the support. We need to perform additional scan if we want to find this.
That is, the subsets of the maximal frequent itemsets are supplied with no infomraiton.

61
Q

Benefit of closed itemsets?

A

Closed itemsets provide minimal representation of all itemsets without loosing support information.

62
Q

Define closed itemset.

A

an itemset X is closed if none of its immediate supersets has exactly the same support count as X.

63
Q

Give the correct and complete procedure of FP growth

A

First we build the frequent pattern tree.
We do this by first finding the support of all individual items. Then we sort each transaction so that each transactions’ internal items are sorted in decreasing order of support.
Then we incrementally build the FP tree by adding the now internally sorted transactions, and updating the cnt value along the way.
When all transactions are included in the FP tree, we are done with this phase.

The next thing is to create the projected/conditional FP trees.
We do this for every item we have. For every item, we do it recursively as well and the recursion bottom out when we have a single path.
The way the procedure here works, is to remove the bottom node, enter it into the prefix, and create the tree that use this prefix. Update the count values as well.
When we reach the point of having a simple path, we mine the frequent patterns by finding all combinations of the nodes in the path that have higher count value than the support threshold.

64
Q

elaborate on ECLAT

A

A method to find support counts for different itemsets.

We use transaction ID’s along with the individual items as columsn.

If we want to count the support for a specific itemset, we intersect the individual items in the itemset and count the result.

65
Q

Downside of ECLAT?

A

Altough it is very fast, it requires lots of memory. Can be too much.

66
Q

Downside of association rules+

A

Algorithms for them typically produce way too many rules, and most of the rules are either redundant or just sucks.

67
Q

What is a measure we can use for “interestingness” regarding association rules?

A

Lift:

Lift = c(A -> B) / s(B)

68
Q

Drawbacks of Apriori?

A

Requires lots of scanning and rule generation, even though it prunes a lot.

69
Q

Elaborate on how apriori actually generate rules

A

Each frequent itemset is a candidate that can generate many rules. To be precise, each itemset can generate 2^k-2 rules excluding rules involving empty sets.
So, for each itemset, we must have a procedure for extracting those rules.

We do the following:
1) We make use of the confidence theorem. The confidence theorem states that if we find that the confidence of a rule X->Y-X is too low, then we know that every rule that use a subset of X will never be confident either. Recall that Y is the entire frequent itemset. This means that subset of X will not change the contents of the rule, it will only change what we have on the antecedent side and the consequent side.

2) With the confidence theorem, we use a level-wise approach where we start with rules that have a single item as the consequent. This is because this offers the greatest pruning potential.

3) We generate rules, check confidence, prune subtree if too low. If confidence is good, we keep it. The next iteration we merge the rules we have left with high confidence. We want to merge in order to keep it as good as possible. The merging of two rules is done by merging their consequents, and then removing from the antecedent any item that is also in the consequent.

4) The merging of rules is actually kind of easy. For each rule we have, we merge this rule once with every other rule (at the same level). If we have 4 rules, we get 3+2+1=6 new rules.
THEN after the rules have been generated, we perform candidate pruning. This is the pruning discussed earlier. So, the correct order is actually to generate all possible rules at a certain level, then to perform candidate pruning.

5) After candidate pruning, we start measuring the new levels of confidance. Recall that the candidate pruning is based purely on the previous level. if the confidence is good enough, we store the rule. If the confidence is too low, we discard the rule.

70
Q

by using closed itemsets, if some itemset is not closed, its support is equal to…?

How does this work with “closed frequent itemset”?

A

equal to the maximum support count of its immediate supersets (recursion).

When consider the closed frequent itemsets, we only need to consider the support of its frequent supersets.

71
Q

elaborate on how we actually find the support counting using only the closed frequent itemset

A

Given the list of closed frequent itemsets, we do the following:

1) First we need to find the set of all frequent itemsets. We need this because otherwise how are we supposed to know what itemsets to give a count-value to?

2) We can find the set of all frequent itemsets in different ways. I like to first look at the itemset with the most individual items, and write down all possible subsets of this itemset. Then I move on to the second largest itemset and do the same. I do this for every closed frequent itemset. This will most likely yield significant overlap, so the next step is to go over the entire list and remove duplicates.

3) Now that we have this list, we can start the support counting procedure. We do this by looping through the “sizes”. we start with k=largestItemsetSize, and iterate down to k=1.
In each iteration, we go through all frequent itemsets from our newly created list. If the specific itemset is closed, we have already assigned it the correct support count, so we move on. If the itemset is not closed, we must assign it a count. We give it the maximum support count from the previous iteration itemsets, but we only consider the ones where the current itemset is a subset. We do this for all non-closed frequent itemsets. Then we move on to the next level.

4) When we have looped though all “sizes” we are essentially done.

x) Some clarifications: We never actually “count” the support values. Physical counting would involve scanning the transaction list, which takes time. What we do instead, is look at the closed frequent itemsets, and recognizing the fact that only subsets of these itemsets can be frequent. If some itemset happened to be frequent, but is not a subset in the list of closed frequent itemsets, then we have a contradiction, and we would have to supply it to the list. So, we are only computing possible combinations of the items. This is much faster than a possibly vast number of transactions. For instance, we could have 1 billion records of itemsets with a maximum size of 3. The combinatorial problem of finding the number of combinations of 3 different items is much easier than counting one billion records. So, we make use of the fact that we have the closed frequent itemset supports.
After solving “the combinatorial problem”, we do a simple loop-iteration through the possible combinations of frequent itemsets. We simply assign it the maximum support count from the itemset of size “one more” that also happen to be a superset of the itemset we’re currently considering. This can be insanely fast, as the alternative, like described, can be to scan millions and millions of records.

72
Q

What is the relationship between closed frequent itemsets and maximal frequent itemsets?

A

maximal frequent itemsets is ALWAYS a subset of closed frequent itemsets.

Recall that maximal frequent itemsets are defined as those who are frequent, and have no immediate supersets who are frequent, meaning that all immediate supersets are infrequent.

Recall that closed frequent itemsets are those that have no immediate supersets with the same support count. Since support count of superset cannto be better, it must be worse.

therefore we can see that if some itemset happen to be maximal frequent, it will also be closed frequent,

73
Q

In the rule generation procedure, the lattice structure show confusing/varying degrees of how many rules are merged together to create a new rule of the next level. Explain this shite

A

It is just a way of removing redundant rules. We will reach a point where combining rules does not always make unique rules. We only keep one and illustrate it by making the lattice-structure look like suddenly there are 3 rules that are used to merge into a new one, whiel in reality it is still 1+1, but we can have many 1+1’s that produce the same rule.

74
Q

Given a brute force approach, how many rules will be generated from d unique items?

A

We have 2^d itemsets.

3^d - 2^(d+1) +1

75
Q

what are the downsides of Apriori for frequent itemset generation?

A

Requires generaiton of huge amounts of candidate sets.

Require many database scans.

76
Q

given a frequent k-itemset, how many frequent sets does it contain?

A

2^k - 1

77
Q
A