Chapter 4 - Association analysis Flashcards
What is a “market basket transaction”
In its simple form, it is a list of items along with an identifier of purchase.
Briefly, what is association analysys all about?
Association analysis is all about discovering “Hidden” relationships among large data sets.
These relationships can be represented as sets of items present in many transactions, known as frequent transactions, OR as association rules.
What is “frequent itemset”?
Frequent itemset refers to a potential discovery of what items are purchased most.
What is “association rule”?
Association rules suggests a relationship between itemsets. For instance:
{Burgerbrød} –> {Burgerkjøtt}.
The association rules suggests that there is a relation betwene these two items.
Is associaiton rule mining limited to market basket transactions?
Of course not. It is just a simple example. We use association rules in every type of data set.
What are the two issues regarding association rule mining used for market basket transactions?
1) Discovering the patterns from a large data set is computationally expensive.
2) Some of the “rules” happen merely by chance and are of no importance. Therefore, some of the “rules” are more important than others.
We are primarily interested in the first part in this course.
What is “binary representation”?
using binary variables/attributes for each possible item, indicating whether it is present in the basket or not. Since binary, we only care about presence, not magnitude.
In the case of market basket transactions, the binary representation is likely to be asymmetric. Asymmetric is about the fact that we dont give a fuck about the “not present 0”, but rather are much more interested in the “1’s”. This is partly because of the fact that, for instance, customers in a store only buy a very small subset of the items that the store has to offer. Therefore it makes little sense to actually care about the items they do not buy. It is much more interesting to watch patterns of items bought together as a direct consequence of the fact that that customers only buy small subsets.
What is “itemset and support count”?
What is support?
In association, a collection of zero or more items is called an itemset.
We can be more specific, and say that if the itemset has k items, it is called a k-itemset.
Support-count is the number of transactions supporting/containing a certain itemset.
In other words, it is a “count” datatype, and it holds the number of times that we find the specific itemset in the entire list of transactions.
Often, we are also interested in “support”, which is the ratio of the support count divided by the total number of transactions.
what is a k-itemset+
Just an itemset (set of 1 or more items) that acutally has k items.
When do we call an itemset “frequent”?
It is frequent if its “support-value” is greater than some pre-specified value that we use as a threshold, called minsup.
We say that a transaction tj contains an itemset X if…?
The transaction tj contains an itemset X if X is a subset of the itemset in the transaction.
More technically, what is an “association rule”?
An association rule is an implication on the form X –> Y, where X and Y are disjoint sets, meaning they contain no same value.
We measure the STRENGTH of an association rule in terms of SUPPORT and its CONFIDENCE levels.
Support determines how often the association rule applies in a data set, while confidence determines how frequently items in Y appear in transactions that contain X.
Define the support of an association rule
Support of an association rule determines how frequently the rule is applicable in a data set:
The sigma value indicates the number of times the union XUY happens in the data set.
Define the confidence of an association rule
Confidence determines how frequently items in Y appear in transactions that contain X:
It looks like “the number of times both happen together, divided by the number of times only X happen”. Therefore, a high confidence means that “if X happens, Y happens”
Danger of low support is …?
That the rule happened merely by chance.
What can we say about association rules and causality?
Nothing.
The association rules tells us relationships that are likely to occur. It tells us nothing about what causes something, or what effects will happen.
if we buy burger, does this cause us to buy burger bread? Or is it hte burger bread that cause us to buy the burger?
Formally define the association rule mining problem
given a dataset containing a list of transactions T. Find all association rules that have support and confidence greater than our thresholds, minsup and minconf.
How do we increase performance of association rule mining?
We need to prune. This is essential because of how the number of possible rules is actually exponential in the number of items.
The pruning is about recognizing that the support of a rule X->Y is the same as the support for XUY, which is actually the same for many rules. Therefore, if we find that one is low in support, we can automatically prune all the others that share the same support expression.
Elaborate on the common strategy in association rule mining
Decompose the problem into 2 major subtasks:
1) Frequent itemset generation
2) Rule generation
Frequent itemset generation is about finding all itemsets that satisfy the support level of our threshold, minsup.
Rule generation aims to extract all the high-confidence rules from the itemset that we generated in “frequent itemset generation”.
Why is it useful to decouple the associaiton rule mining strategy into two parts?
We take advantage of the fact that, due to how the support is defined, we will have many subsets that have the same level of support. All of these subsets can be pruned away if the itemset is infrequent.
What structure can we use to enumerate all possible itemsets?
A lattice structure.
A data set that contains k items can at max generate 2^k-1 frequent itemsets, excluding the null set.
What is a candidate itemset?
A candidate itemset is a possible frequent itemset. It is just any itemset. We call it candidate before we have counted the support and dtermined if it is frequent or not.
Define “transaction width”
The number of items present in a transaction
Explian brute force frequent item generation
We check every possible candidate set against every transaction. This is terrible, because it requires insane computation time.
How can we reduce computations in “frequent item generation”?
3 ways:
1) Reduce the number of candidate sets (Apriori)
2) Reduce the number of comparisons
3) Reduce the number of transactions
Elaborate on the Apriori principle
If an itemset is frequent, then all of its subsets must also be requent.
recall the support-count definition: Number of times an itemset occurs in the transaction list. Support is defined as the support-count divided by the total number of transactions.
The apriori principle ALSO states that if some subset is infrequent, then all of its supersets must also be infrequent.
Directly from this result, we know that if we discover that {a,b} is INFREQUENT; then we can PRUNE every superset that contains {a,b}. This is called support-based pruning.
What is “support-based pruning”?
Using support levels to determine if a subset is frequent or infrequent, and use the result to prune away supersets in the entire space of itemsets.
Support based pruning is possible because the support of a superset NEVER exceed the support for its subsets. This is called the anti monotone property of support measure.
Statement (not question):
The support for an itemset NEVER exceeds the support for any of its subsets.
What is the “anti-monotone property” of the support measure?
A measure “f” possesses the anti-monotone property if for every itemset X that is a proper subset of Y, we have f(Y) <= f(X).
It is the anti-monotone property of support-measure that allows us to use support-based pruning. The statement says that a superset can never be more frequent than a regular set, or a subset.
How does Apriori “start”?
Apriori considers every itemset to be 1-itemsets. That is, it first looks at individual support-counts, meaning that it looks at individual items and their support counts.
If a support is lower than minsup, discard immediatly.