Pre-exam Flashcards

Question

What is partitonal clustering?

Answer 1

A division of data objects into non-oberlapping subsets (clusters) such that each data object is exactly one subset. Simply put, a division of the data into a group or subset.

Answer 2

Generate two matricies (proximity matrix and incidence matrix) 1 row and 1 colum for each data point. An entry is 1 if the associated pair of points belng to the same cluster and 0 if they belong to different clusters. Compute the correlation between the two matricies. High correlation incidates that points that belong to the same cluster are close together.

Answer 3

Recall measures the fraction of positive examples correctly predicted by the classifier. r = TP / ( TP + FN )

Answer 4

A collection of one or more items. E.g: (Milk, Bread, Diaper)

Answer 5

It can become computationally expensive to process and the number of itmesets will be very large.

Answer 6

Just get rid of them.

Answer 7

Data sets with imbalanced class distirubtions. E.g. Credit card fraud. It's very rare that fraud exists, but when it does, it's important and should be given a higher weight.

Answer 8

a frequent itemset for which none of its immediate supersets are frequent. \* When a border is drawn in a lattice to distinguish between frequent and non frequent, the items residing near the border (on the frequent side) are maximal frequent itemsets. Their immediate supersets are infrequent. This is the same for non-maximal itemsets (on the other side of the border).

Answer 9

Using a denogram. The horizontal heights of the linkages represent the order in which clusters are formed.

Answer 10

Type of proximimity or density measure Sparseness Attribute type Dimensionality Noise & Outliers Type of distribution

Answer 11

tends to break large clusters biased toward globular clusters

Answer 12

Given a database of sequences and a user-specified minimum support threshold, minsup, find all subsequences with support \>= minsup

Answer 13

In the basic k-means algorithm, centroids are updated after all points are assigned to acentroid. An lternative is to update the centroids after each assignment (incremental approach).

Answer 14

If the scales differ, then yes standardization is necessary.

Answer 15

The validation of clustering is tricky. The result can look good, but this may not always be the case.

Answer 16

used to measure the extend to which cluster labels match externally supplied class labels

Answer 17

More expensive Intorudces an order dependency Never get an empty cluster (this approach is not used often)

Answer 18

Condience measures the reliability of the inference made by a rule. The higher the confidence, the more likely it is for Y to be present in transactions that contain X. Confidence also provides an estimate of the conditional probability of Y given X.

Answer 19

Less sucestible to nosie and outliers biased towards globular clusters can be used to initialize k-means

Answer 20

takes the average values of pairs of attributes and subtracts them from the mean of values gets the co-variance to scanle the distance measures an alternative to normalizzation (scaling is built in) - take into account the dspread of the data in a direction

Answer 21

A full application of a complete data set to a neural network.

Answer 22

SSE can be used to cmpare two clusterings or two clusters (average SSE). It can also be used to estimate the # of clusters.

Answer 23

AdaBoost creates many classifiers / models and repreatedly draws from samples. Samples that are easy to classifiy get a lower weight, and ones that are harder to classify get a higher weight. If any intermediate rounds produce an error rate higher than 50%, the weights are reverted back and the resampling procedure is repreated. The classifier also gets a weight.

Answer 24

Brain imaging Weather

Answer 25

goal of findingclusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the 'goodness' of each potential set of clusters by using the given objective function.

Answer 26

Clusters that share some common property or rpresent a particular concept. (for example, two overlapping rings).

Answer 27

Allow for some contamination by using slack variables.

Answer 28

Determining whether non-random structures exist in the data Comparing results to externally known results (eg. class label) Determining which clustering technique is better Determining the "correct" number of clusters

Answer 29

A model constructed with low variance and high bias tends to underfit training data. Both underfitting and overfitting can lead to a model that performs poorly.

Answer 30

Bias is a systematic shift in ground truth The stronger the assumptinos made by a classifier about the nature of its decision boundary, the larger the classifier's bias will be. Design choices such as choice of algorithm can introduce bias too.

Answer 31

Rules that have low support may only occur by chance. A low support rule is likely to be uninteresting from a business perspective.

Answer 32

1. Threshold (density) MinPts 2. Radius (how far out) Eps

Answer 33

A border point has fewer than MinPts (user specified threshold) within Eps (a user specified radius) but within the neighbourhood of a core point

Answer 34

The fraction of negative examples correctly predicted by the model. TNR = TN / (TN + FP)

Answer 35

A matrix of items that were bought and items that were not bought.

Answer 36

To ensure that their worst-case generalization errors are minimized.

Answer 37

1. supervised class labels 2. unsupervised no class labels 3. semisupervised some class labels

Answer 38

The similarity of two clusters is based on the increase in squarred error when two clusters are merged

Answer 39

The number of items decreases. It can be difficult to find an appripriate support threshold because most items in a store will have a low support count.

Answer 40

it's less succesible to noise and outliers

Answer 41

It shows how accuracy changes with varying sample size. Requires a sampling schedule. In the graph their is a horizontal bar near the top which is an upper limit on accuracy. You can never surpass this bar due to noise, etc... in the data.

Answer 42

One of the disadvantages is that, what should be one cluster, is often split into two.

Answer 43

consider the problem to be the same as an imbalanced class problem two approaches: cost sensitive re-smampling

Answer 44

The data must be linearly seperable.

Answer 45

1) the different classifiers make different mistakes in the data 2) the different classifiers perform better than random guessing

Answer 46

It's a metric used to compare the performance of classifiers. Accuracy = TP + TN / TP + TN + FP + FN

Answer 47

1. A binary table 2. A lattice with a boundary where we found frequent items (all subsets will be frequent).

Answer 48

The items at the lower level of the heirarchy may not have neough support to appear in the frequent itemset (this is good because rules at the bottom tend to be overly specific and may not be as interesting).

Answer 49

partitional

Answer 50

1. Agglomerative 2. Divisive

Answer 51

If d is too small, then many normal points may have low desnity and thus a high outlier score. If d is too large, then many outliers may have desnities (and outlier scores) that are similar to normal points.

Answer 52

In fuzzy clustering, a point belongs to every cluster with some weight bewteen 0 and 1. The weights must sum to 1.

Answer 53

can handle unknown # of clusters requires to user-defined parameters to be specified paritinoed based (fixed # of clusters) partial clustering (doesn't get rid of all points) deals best with noise because it gets rid of it worst of 3 approaches for clustering data with varying densities deterministic

Answer 54

No, but confidence of rules generated from the same itemset do.

Answer 55

It divides the range into N intervals, each containing approximately same number of samples

Answer 56

Antecedent: The condition for a rule (left side) The end result for a rule (a class). "The consequent" (right side)

Answer 57

Smaller data sets means rows have more influrence Need repeated passes over the data

Answer 58

Cuts down on the number of rules by grouping.

Answer 59

The sum of Squared Error (SSE)

Answer 60

Minimize the maximum distance from clusters (tends to produce "round" things).

Answer 61

Agglomerative clustering algorithm

Answer 62

r= 1 is the hamming distance r = 2 is the euclidean distance r \> 3 ( ∞ ) is the supremum distance

Answer 63

Plot the SSE vs k (the # of points?) and look for the knee in the curve. \* The knee method is good for spehrical clusters because it may not be as pronouced otherwise.

Answer 64

computational chemistry, bioinformatics, spatial data sets,etc..

Answer 65

It tells you how accurate your model is by showing you the TP FN FP and TN instances for a given classifier in a matrix format.

Answer 66

- It's a lazy learner - It does not build a model explicitly - Classifying unknown records can be relatively expensive

Answer 67

Measures how often items in Y appear in transactions that contain X. { X } → { Y }

Answer 68

A set of nested clusters organized as a heirarchical tree Simply put, each cluster can have a sub-cluster

Answer 69

a model is an abstract representation of a real system

Answer 70

-The set of stored records -Distance metric to compute the distance between records -The value of k, the number of nearest neighbours to retrieve.

Answer 71

The fraction of transactions that contain both X and Y. Support determines how often a rule is applicable to a given data set. { X } → { Y }

Answer 72

If you increase the # of nodes (the complexity of the tree) the variance will increase and the bias will decrease.

Answer 73

to obtain k clusters, split the set of all points into two clusters, select one of these clusters to split and so on, until k clusters have been produced.

Answer 74

Box plots, histograms, and scatter plots

Answer 75

1. how to identify interesting infrequent patters 2. how to efficently discover them in large data sets

Answer 76

measurement error collection error natural variation

Answer 77

Items residing at the higher levels tend to have higher support counts than those resideing at the lower levels. Only the patterns residing in the higher levels are likely to have patterns extracted. Conversely, if the threshold is set too low, then you get too many rules. Increases computation time. May lead to redundant rules.

Answer 78

Give more emphasis on specific examples that are difficult to classify. Assign a higher weight, greater probability of being selected to them. Records that are wrongly classified will have their weights increased. Records that are classified correctly will have their weights decreased.

Answer 79

They do not have to assume any particular number of clusters They correspond to meaningful taxonomies (eg, biological sciences -- animal kingdon)

Answer 80

Apriori pruning strategy: If an itemset is infrequent, then all of its supersets must also be inrequent. All of these infrequent itemsets can be pruned. Multiple scans of DB. FP-Growth: encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets directly from this structure. Divide and conquer approach. Outperforms the standard Apriori algorithm by several orders of magnitude. Requires less memory. Scan DB only twice.

Answer 81

Extends assocation rule mining to find frequent subgraphs. Graphs that extend to subraphs Looking for common subgraphs

Answer 82

Every item is considered as a candidate 1-itemset {cola}. Discard itemsets that don't meet support threshold. Do the same with 2-itemsets using only the frequent 1-itemsets (because the Apriori principal holds). Repeat for the maximum number of items in an itemset.

Answer 83

They don't get stuck in a local minimum like ANNs and KNNs. Especially when the data is transformed to a new space.

Answer 84

1. Start with a tree that consists of any point 2. In successive steps, look for the closet pair of points such that one point is in the current tree and the other is not. 3. Add the point that is not to the tree into the tree and put an edge between these two points (this is a top down approach; it is uncommon and rarely used)

Answer 85

1. Label all points as core, border, or noise points 2. Eliminate noise points 3. Put an edge between all core points that are within the user specified radis (Eps) of each other 4. Make each group of connected core points into a speerate cluster 5. Assign each border point to one of the clusters of its associated core points

Answer 86

1. multiple runs (not the best approach) 2. sample and use hierarchical clustering to determine intial centroids 3. select more than k intiial centroids and then select among these initial centroids 4. postprocessing 5. Bisecting k-means (split clusers into multiple clusters)

Answer 87

the anomaly completely depends on the context that you are looking at.

Answer 88

An implication expression of the form X → Y where X and Y are itemsets. e.g. { milk, diaper } → { beer }

Answer 89

cannot undo a decision once two clusters have been combined no objective function is directly minimized have problems with one or more of the following: sensitivity to noise difficulty handling different sized clusters and convex shapes breaking large clusters

Answer 90

There are infinitely many hyperplanes (ways to split classes in a model). The SVM must choose one of the hyperplanes to represent its decision boundary, based on how well they are expecte4d to perform on test examples. Trying to maximize the width of the "road" / margin.

Answer 91

In some cases, we only want to cluster some of the data.

Answer 92

A measure used to evaluate interestingness in association rule analysis that takes prior probabilities into account. It's much more reliable than other techniques such confidence level. If the lift value is around 1, we can assume that the rule is statistically independent.

Answer 93

1. Min 2. Max 3. Group average 4. Distance between centroids 5. Other methods with a defined method

Answer 94

works if the # of clusters are unknown (a horizontal cut in the dendogram specifies the # of clusters) can set a max distance for threshold complete clustering noise depends on linkage criteria deterministic

Answer 95

ECG patterns

Answer 96

Reads data set one transaction at a time Maps each transation onto a path in the FP-tree Paths may overlap as transactions are similar The more paths overlap, the more compression Sometimes makes tree small enough to fit into main memory

Answer 97

variance that you would expect to see Specifically, it refers to the distribution of numbers for one variable in relation to the distribution of numbers for another variable. find points that would defy this for outlier detection

Answer 98

Binning and descretization.

Answer 99

Confidence is an indication of how often the rule has been found to be true. conf = (X -\> Y) = supp(X,Y) /supp(X) Fx: **Support milk: 5 = 5/10 = 0.5 conf = milk -\> diapers 4/5 = 0.8**

Answer 100

used to compare two different clusterings or clusters

Answer 101

They all come down to preprocessing the data so that the APRIORI algorithm can be applied to them.

Answer 102

A model constructed with high bias and low variance tends to underfit training data.

Answer 103

For each point, the error is the distance to the nearest cluster. To get the SSE, we square the errors and then sum them.

Answer 104

measures how closely related objects in a cluster are measured by the wihtin cluster sum of squares (SSE)

Answer 105

In non-exclusive clusterings, points may belong to multiple clusters.

Answer 106

If an itemset is frequent, then all of its subsets must also be frequent. {A B C} = frequent then {A B }, {A C}, {B C} = frequent Conversely, if an itemset is infrequent, then all of its supersets must be infrequent too. {A B } infrequent itemsets can be pruned

Answer 107

K-means (and it's variants) Hierarchical clustering Density-based clustering

Answer 108

Variance is a measure of spread of data

Answer 109

The goal is to maximize the width of the margin "road".

Answer 110

Still incurs considerable I/O overhead since it requires making several passes over the transaction data set. May degrade significantly for dense data sets because of the increasing width of transactions.

Answer 111

Understanding: group related items that have similarities Summarization: reduce the size of large data sets

Answer 112

Age(21,35) ^ Salary(70k,120k) - \> buy

Answer 113

A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data. If data is noisy, perhaps it's better to have high bias, and lower variance. Choise of classifier is important. Bagging and Boosting can help.

Answer 114

preprocessing

Answer 115

The Apriori principal holds for sequential data becase any data sequence contains a particular k-sequence must also contain all of its (k-1) subsequences. An Apriori-like algorithm can be used to extraxt sequential patters from a sequence data set.

Answer 116

Finding groups of objects such that the objects in a group are similar (or related) to one another and different from (or unrelated to) objects in other groups.

Answer 117

user specified number of clusters

Answer 118

1. use a parameter based model which describes the distribution of the data (eg gaussian distribution etc...) 2. apply statistical tests which depend on the: distriubtion of the data population parametes or sample statistics number of expected outliers

Answer 119

values are mapped to integers 0 to n-1 where n is the # of values dissimilarity = p - q / # of values (n) similarity = 1 - p-q / # of values (n)

Answer 120

The median is often more appropriate because outliers affect the mean greatly.

Answer 121

Stores the training records Uses the training records to predict the class label of unseen cases

Answer 122

divides the range into N intervals of equal size

Answer 123

Can handle non-elliptical shapes.

Answer 124

It's computationally prohibative. It would involve: list all possible assocation rules computing the suppor tand confidence for each rule prune rules that fail the minsup and minconf thresholds

Answer 125

Improves classification accuracy by aggregating the predictions of multiple classifiers.

Answer 126

Transform each graph into a transaction-like format so that existing algorithms such as Apriori can be applied.

Answer 127

Normalize the data (it's distance based) Eliminate outliers

Answer 128

Plot the distance of every point to its kth nearest neighbour, then find the biggest change or "knee" in the curve.

Answer 129

A momentum term to avoid falling in a local minimum. A noise term used over multiple iterations. At first the term is very noisy and gets less noisy as tree becomes more accurate. Tries to prevent a local minimum.

Answer 130

A point, patter, or set of patterns which do not conform to what we define as normal within the data

Answer 131

Determining if each enuerated k-itemset corresponds to an existing candidate itemset.

Answer 132

repreatedly remove an edge from the candidate k-subgraph and checking whether the correspnding (k-1) subgraph is cnnneted and frequent.

Answer 133

1. outlier label: a point or group of popints are labeled as an anomaly or normal 2. outlier score: assign an outlier score to a data point or group of data points represents degree of outlierness can create a ranked list or use a threshold

Answer 134

1. Generate all itemsets whose support \>= minsup 2. Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset.

Answer 135

Find all non-empty subsets in a given frequent itemset such that such that the subset satisfies the minimum confidence requirement

Answer 136

If one classifier is better than others, you can weight the vote of the classifiers and make the better performing classifier have a higher weight.

Answer 137

a simple linear regression is basically an equation of a line

Answer 138

Test error rate increases Training error rate continues to decrease

Answer 139

Eliminate small clusters that may represent outliers Split 'loose' clusters with relatively high SSE Merge clusters that are 'close' and have relatively low SSE

Answer 140

- remove redundant or irrelevant features - transform to numerical values - normalize data to the range of 0 to 1 or -1 to 1

Answer 141

1. Compute the proximity matrix 2. Let each data point be a cluster 3. repeat 4. merge the two closest clusters update the proximity matrix 6. until only a single cluster remains

Answer 142

a numerical measure of how alike two data objects are it is higher when objects are more alike often falls in the range of 0 and 1

Answer 143

The fraction of transactions that contain an itemset. s ( { milk, bread, diaper } ) = 2/5

Answer 144

The support for an itemset never exceeds the support for its subsets.

Answer 145

This introduces bias -- a data miner could simply pick the iteration that produced the best result if they wanted. It's better to do some form of smart sampling or preprocessing instead.

Answer 146

1. Present sample to input nodes. 2. Propogate data through layers 3. Calculate results at output nodes 4. Determine error at output nodes 5. Propogate error backwards to adjust the weights Repeat until stopping criterion is satisified

Answer 147

Corret behavior should be rewarded by increasing weights

Answer 148

Reciever Operating Characteristic (ROC) It can be used to visualize the TP (x) vs FP rate (y) of a given model. It also allows for relative comparrison across different models (each model is represented by a curve).

Answer 149

related to anomaly detection but is a field of it's own the outlier hasn't occured yet, and is only an outlier for a brief period of time and then it becomes a part of the normal model

Answer 150

The support for an itemset never exceeds the support for its subsets.

Answer 151

the outlier exists as a seuqnce AAABBBAAABBB**ABABAB**AAABBBAAABBB

Answer 152

Transform to matrix using adjaceny lists You can add edge weights but this blowes up the matrix even more.

Answer 153

Can't directly visualize -- very difficult

Answer 154

a generalization of the euclidean distance it's the same as the euclidean distance, but with a parameter r instead of ² This r parameter gives more flexability. n is the # of parameters. n is the # of dimensions (attributes)

Answer 155

The # of attributes determines the # of input nodes, so its important to remove redundant or irrelevant attributes to keep down the # of connections and avoid a local minimum.

Answer 156

automatically determines the minority class finds the k nearest neighbours for each minority class randomly chooes from ke nearest neighbours depending on level of oversampling required creates a syntehci point along the line sequents connnectiong that points to its k nearesa neiughbor

Answer 157

An object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster. These types of clusters thend to be globular.

Answer 158

The density around an object is equal to the # of objects taht are within a specified distance d of the object.

Answer 159

**Combinatoric explosion**: how many different transactions could possibly occur in the database? Formula is 2^n (fx: 2^3 = 8 for 3 items) (includes empty set) * *Binomial coefficient (when its itemsets)** * *!n/!k(n-k)!** (k is nr of itemsets, n is number of items in D) How many transactions with exactly 2-itemsets can we have when the database contains 3 and 5 items? ``` 3!/2!(3-2)! = 3 5!/2!(5-2)! = 10 ``` **Putting it together What is the maximum size of frequent itemsets that can be extracted (assuming sigma\> 0)?** See pic - binomial \* 2^n -1

Answer 160

Clusters where points are closer to points in the same cluster than to points in every other cluster.

Answer 161

Work before vs work after Lazy learner vs eager learner

Answer 162

compute the distance between all pairs of points outlier score can be defined as the distance from the kth nearest neighbour or the average distance from the k-nearest neighbours

Answer 163

similar if p = q and dissimilar if p \<\> q

Answer 164

The number of training sets has to be larger than the # of weights divided by (1 - accuracy). Training sets = weights / (1 - accuracy)

Answer 165

There are over 20... Gini index Jaccard Cosine Intersect Laplace...

Answer 166

- If k is too small, the model is sensitive to noise points - If k is too large, the neighbourhood may include points from other classes

Answer 167

sensitive to noise and outliers

Answer 168

The error term is distrubted normally

Answer 169

measures how distinct or well-seperated a cluster is from other clusters measured by the between cluster sum of squares

Answer 170

defining normal regions evolving definition of normal differing notion of anomalies number of attributes used to define an anomaly noise

Answer 171

The frequency of occurances of an itemset. Denoted by σ e.g. σ ( { milk, bread, diaper } ) = 2

Answer 172

DBSCAN doese not work well for clusters that have varying densities

Answer 173

Support ( s ) Confidence ( c )

Answer 174

Clustering tendency

Answer 175

Interstingness can be computied using a contingency table.

Answer 176

Overfitting is modelling a random noise component in the data (model is too complex). Increasing the complexity of the model means you have to estimate more parameters and their is a greater probability for error. Simpler models tend to have low variance and potentially higher bias. Vizualization can help you to pick a good model.

Answer 177

It's basically a model for KNN. It splits the solution space.. on a line = equal distance to parents on an intersectoin = equal distance to 3 parents

Answer 178

Confidence can only decrease. If a parent doesn't meet the confidence threshold, the children whon't meet the threshold either.

Answer 179

two objects are connected only if they are within a specified distances of each other.

Answer 180

intrusion detection fraud, medical, image processing

Answer 181

A noise point is any point that is not a core point or a border point

Answer 182

1. point 2. contextual 3. collective

Answer 183

pattern association character recognition image compression classification forecasting optimization etc...

Answer 184

Compute the distance between two points (Euclidean distance) Determine the class from nearest neighbour list - take the majority vote of class labels among the k-nearest neighbours - weight the vote according to distance

Answer 185

cluster cohesion is the sum of the weights of all the links within the cluster cluster seperation is the sum of the weights between nodes in the cluster and nodes outside the cluster

Answer 186

a point within a graph

Answer 187

produces a set of nested clusters organized as a hierarchical tree these clusters can be vizualized as a dendogram (a tree like structre that records the sequences of merges and splits)

Answer 188

Determines the fraction of reecords that actually turns out to be positive in the group the classifier has declared as a positive class. TP / ( TP + FP )

Answer 189

Clusters are in the eye of the beholder (they are subjective)

Answer 190

computationally expensive sensitive to the chosen value of k measningless in high dimensional space

Answer 191

similarity = | p - q | disimilarity is equal to the negative value of similarity

Answer 192

You can use different support counts for different support items.

Answer 193

A high correlation means a good clustering, a low correlation means a poor clustering.

Answer 194

A point is a core point if it has more than a specified number of points (MinPts) within Eps (a specified radius)

Answer 195

The count of the itemset (then / with total nr of T) fx the itemset {milk,diapers} has a support of 1/5=0.2 since it occurs in 20% of all transactions (1 out of 5 transactions).

Answer 196

The distance measure The use of clustering algorithm

Answer 197

Providing a compact representation of frequent itemsets. They form the smallest set of itemsets from which all frequent itemsets can be derived.

Answer 198

points that are not strongly related can be considered as outliers cluster, remove outlier, repeat requires a stopping criterion

Answer 199

trimming the exponential search space based on the support measure.

Answer 200

Robust to noise points Can handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independance assumption may not hold for some attributes (but can use other techniques such as bayesian belief networks instead)

Answer 201

+ easy to explain + interpretable + works with categorical variables + fast - high variance (A DT has high variance because, if you imagine a very large/deep tree, it can basically adjust its predictions to every single input because it gets very specific with the training examples

Answer 202

Using a distance measure such as Euclidean distance, cosine similarity, correlation, etc...

Answer 203

1. Direct methods which extract rules directly from the data 2. Indirect methods which extract rules from other classification models such as decision trees. `

Answer 204

the choice of inital centroids

Answer 205

when the clusters are irregular or itnertwined and when noise and outliers are present

Answer 206

To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

Answer 207

Initally make a pass over data to determine the support of teach item and determine 1-itemsets. Iteratively generate new candidate k-itemsets using the k-1 itemsets found in the previous iteration. Make an additional pass over the data set to count the support of canddiates and eliminate candidates whose support is less than minsup. Terminate when there are no new frequent itemsets generated.

Answer 208

An itemset whose support is greater than or equal to a minsup threshold. e.g. at least 10 occurances together in the database.

Answer 209

1. removal 2. accomodation - keep but use accomodation methods while processing 3. explanation - why does it exist?

Answer 210

DBscan is a density based algorithm used to find clusters

Answer 211

A measure that tries to maximize both precision and recall. F1 = 2 x TP / ( 2 x TP + FP + FN )

Answer 212

min. leaf size (Each leaf node represents a class) Pruning a decision tree (pre & post) max depth of the tree max nr of nodes min decrease in loss

Answer 213

a residual is a vertical line sprouting off of the diagonal line of a linear regression

Answer 214

K means has problems when clusters differ in size, densities, and are non-spherical (globular) shapes. K-means has problems when the data contains outliers.

Answer 215

no. one is not the opposite of the other, unless the range is from 0 t o 1.

Answer 216

1. Reduce the number of candidates (M) using pruning 2. Reduce the number of transactions by reducing the size of N as the size of the itemset increases 3. Reduce the number of comparisions by using efficient data structures to store candidates or transactions. No need to match every candidate against every transaction.

Answer 217

combines ideas of both choesion and speeration, but for individual points, as well as clusters and clusterings looks as closeness within a cluster and closeness between clusters

Answer 218

A cluster where a dense region of objects is surrounded by a region of low density.

Answer 219

Execution time (possible ranges of values) Too many rules {refund = No, Income {cheat = no} {refund = No, 90 k Income {cheat = no} {refund = No, 50K \< Income \< 52K } -\> {cheat = no}

Answer 220

noise is often created by measurement error, extreme values noise is typically uninteresting and generally meaningless Anomalies are data points created by different mechanisms - usually very interesting

Answer 221

Extract all the high-confidence rules from the frequent itemsets found in the previous step (frequent itemset generation). These rules are called strong rules.

Answer 222

1. core point 2. border point 3. noise point

Answer 223

A model with high variance and low bias tends to generalize new test instances well, but is susceptible to overfitting noisy data.

Answer 224

Uses a user specified number of clusters Paritioned-based Complete clustering Non-deterministic (can exhibit different behaviors on different runs) noise stays in space and contributes to SSE

Answer 225

Given a set of transactions, find rules that will predict the occurance of an item based on the occurences of other items in the transaction.

Answer 226

External index Internal index Relative index

Answer 227

The most common distinction is whether the set of clusters are nested (hierarchical) or unested.

Answer 228

A probabilistic framework for solving classification problems. It uses conditional probability.

Answer 229

highly granular (detailed)

Answer 230

3 dimensional scatter plots

Answer 231

A cost matrix assigns a cost to the TP FN FP TN instances. It's good for imbalanced classes. You can assign a high cost to instances that are classified incorrectly. The goal is to have high accuracy and low cost.

Answer 232

Competative learning - For each sample (set of input value, one node wil lbe the best match. The weights between this winner and the input nodes should be increased.

Answer 233

refers to a dimilarity or dissimilarity

Answer 234

Points in a SVM that lie next to the maximized margins. SVMs use a subset of training examples called support vectors.

Answer 235

A set of clusters

Answer 236

Items are grouped into categories which cuts down on the # of items. There are then hierarchical levels. For example, Electronics as the root, then computers... etc..

Answer 237

A series of connected nodes that show all possible combinations of an itemset, and sometimes a subset of combinations.

Answer 238

Maximal freuqnet itemsets are subset of closed frequent itemsets. Closed frequent itemsets are a subset of frequent itemsets.

Answer 239

An itemset is closed if none of its immediate supersets has the same support as the itemset.

Answer 240

itemsets involving interesting rare items (e.g. expensive products) could be missed.

Answer 241

proximity of two clusters is the average pairwise proximity between points in the two clusters

Answer 242

Provide a mininimal representation of itemsets without losing their support information. An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.

Answer 243

Find all the itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.

Answer 244

1. Holdout (part data training, part data testing) 2. Random subsampling (repeating holdout several times) 3. Cross validation (each record is used the same # of times for training and 1 time for testing) 4. Bootstrap - sample w/ replacement

Answer 245

proposes ways of evaluating unsupervised classifiers (clustering) evaluating unsupervised classifiers is much more difficult (you can't use measures such as accuracy and class labels, etc...)

Answer 246

Unsupervised approaches don't give us feed back about how good our model is. Thsi is why post-processing is important.

Answer 247

The cluster centres are usually picked randomly (but you can pick them to speed things up). This increases the likeliness of the model getting stuck in a local minimum.

Answer 248

a numerical measure of how different two data objects are lower when objects are more alike minimm dissimilarity is often 0 upper limit varies

Answer 249

Support threshold (lower the threshold results in more itemsets being declared as frequent) Number of items (dimensionality - more space will be required to store the support counts of items). The number of transactions (Apriori makes repeated passes over data set - increases with larger # of transactions) Average transaction width - dense data sets, the average transaction width can be large.

Answer 250

Examines events that occur over time over customers. Events can be grouped together. Becomes computationally expensive.

Answer 251

SVMs try to find the global minimum unlike neural networks, which employ a greedy based strategy to search the hypothsis space.

Answer 252

A line across the main diagnonal connecting 0,0 and 1,1.

Answer 253

An alternative method for discovering frequent itemsets It encodes the data using a compact structure called an FP-tree It extracts frequent itemsets directly from the structure.

Answer 254

An esemble method constructs a set of base classifiers from training data and performs classification by taking a vote on the predictions made by each base classifier.

Answer 255

cluster cohesion cluster seperation

Answer 256

it's not a good measure for some density or contiguity (connected) based clusters

Answer 257

link the next closest point from another cluster

Answer 258

Clustering of same type or different types. For example, clusters can vary widley in size, shape, and densities.

Answer 259

Used to measure the goodness of a clustering structure without respect to external information

Answer 260

Discretize the range of values into bins - Can use two way split (A \< v) or (A \> v) - Or Probnability density estimation assume the values follow a normal distribution Use data to estimate parameters of distribution Can use it to estimate conditional probability

Answer 261

It assumes independence among attributes.

Answer 262

Fitting noise points Not enough representative data

Pre-exam Flashcards

(290 cards)