Final Exam Flashcards

Question

How does the FP growth algorithm mine frequent itemsets?

Answer 1

It uses a recursive divided-and-conquer approach It uses pointrs to assist frequent itemset generation. It requires preprocessing.

Answer 2

A graphical chart with TP rate on y axis and FP rate on x axis. Each points repersents a point that corresponds to a model induced by a classifier.

Answer 3

An effective clustering algorithm An abstraction of the input data

Answer 4

Used to compare performance of models of correct predictions / # of predictons TP + TN / (TP + TN + FP + FN)

Answer 5

1. Statistical (grubbs test & linear regression) 2. Density based (DBscan) 3. Proximity based (KNN)

Answer 6

fraction of negative examples predicted as positive class FPR = FP / (FP + TN)

Answer 7

Reads data set one transaction at a time Maps each transation onto a path in the FP-tree Paths may overlap as transactions are similar The more paths overlap, the more compression Sometimes makes tree small enough to fit into main memory

Answer 8

An itemset is maximal frequent if none of its immediate supersets is frequent

Answer 9

1. Point: just a point on a 1D graph 2. Contextual: depends on the context that you are looking at 3. Collective: an outlier that exists as a sequence

Answer 10

FP tree Hash tree

Answer 11

The fraction of positive examples out of examples declared as positive p = TP / TP + FP

Answer 12

1. Feature selection 2. Dimensionality reduction 3. Normalization

Answer 13

Bagging (bootstrap aggregating) is basically out of bag estimation. Sample repreatedly with replacement from the original data to create new training sets. Reduces variance and helps to avoid overfitting.

Answer 14

if an itemset is frquent, then all of its subsets are frequent. Conversely, if an itemset if infrequen, then all of its supersets must be infrequent too.

Answer 15

* a generalization of the euclidean distance * it's the same as the euclidean distance, but with a parameter r instead of 2

Answer 16

deterministic? yes suceptible to falling in a local minimum? no need to specify # of clusters? no is noise removed? yes partitioned based? no can handle varying densities? yes

Answer 17

Let k = 1 generate frequent itemsets of length 1 Repeat: 1. generate (k+1) candidate itemsets from k frequent itemsets 2. prune candidate itemsets containing subsets of length k that are infrequent 3. count support for each canddiate by scanning the DB 4. eliminate candidates that are infrequent

Answer 18

fraction of negative examples predicted correctly by the classifier TNR = TN / (TN + FP)

Answer 19

An itemset that meets the minsup threshold.

Answer 20

deterministic? no (the map is randomly initialized) suceptible to falling in a local minimum? yes need to specify # of clusters? is noise removed? no partitioned based? not sure... can handle varying densities? not sure...

Answer 21

create 1 new attribute with a unique representation for each ordinal value

Answer 22

fraction of positive examples predicted correctly by the classifier TPR = TP / (TP + FN)

Answer 23

Sort the threshold values from lowest to highest for all of the samples. Plot the TPR vs the TNR for each sample.

Answer 24

deterministic? yes suceptible to falling in a local minimum? no need to specify # of clusters? no is noise removed? yes partitioned based? yes can handle varying densities? no

Answer 25

* AdaBoost creates many classifiers / models and repreatedly draws from samples. * Samples that are easy to classifiy get a lower weight, and ones that are harder to classify get a higher weight. * If any intermediate rounds produce an error rate higher than 50%, the weights are reverted back and the resampling procedure is repreated. * The classifier also gets a weight.

Answer 26

1) the different classifiers make different mistakes in the data 2) the different classifiers perform better than random guessing

Answer 27

1. Simple Random Sampling: each object is selected with equal probability 2. Sampling with replacement: do not remove the object when sample is selected 3. Sampling without replacement: remove the object when sample is selected 4. Stratified sampling: split the data into several paritions and sample randomly from each one

Answer 28

1. Normalization techniques - scaling - min-max - decimal scaling 2. Can choose an algorithm that is not affected by different ranges (decision trees) 3. Mahalanobis distance 4. PCA - cpatures largest variation and reduces dimensions

Answer 29

An itemset is closed if none of its immediate supersets has the same support as the itemset. It's a compressed representation of support

Final Exam Flashcards

(53 cards)