Pre-exam Flashcards
(290 cards)
What is bagging with respect to data mining and ensemble methods?
Bagging (bootstrap aggregating) is basically out of bag estimation.
Sample repreatedly with replacement from the original data to create new training sets.
Reduces variance and helps to avoid overfitting.
What is agglomerative hierarchical clustering?
start with the points as indiivudual clusters
at each step, merge the closest pair of clusters until only one cluster (or k clusters) are left
What is true positive rate?
The fraction of positive examples predicted correctly by the model.
TPR = TP / ( TP + FN )
What is the challenge of applying assocation analysis to non-asymmetric binary variables?
Can cause computation issues.
Need to bin and use discretization.
Bin continuous and map categorical.
When are statistical approaches to outlier detection effective?
When there is sufficient knowledge of the data and ype of test that should be applied.
Fewer options are available for multivariate data.
These statistical tests perform poorly on highly dimensional data.
What is the drawback of using confidence level for association rule analysis?
The rules can be misleading even though confidence may be high
eg.
Confidence(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Although confidence is high,
the rule is misleading: P(Coffee| !Tea) = 0.9375
What are alternative methods to the Apriori algorithms for generating frequient itemsets?
- Traversal of itemset lattice
- FP-Growth Algorithm
What approach can be used if a support vector machine is not linear?
Math can be used to map the data to a new space and the new data can be used to classify.
In this case, the data needs to be numerical.
Using a ROC curve, how is one model compared to another?
The area under the curve tells you how the model performed. The ideal area is one. You want the curve to be as close to the upper left corner as possible (high TP rate and low FP rate).
Describe the Correlation learning method for ANNs.
Correlation (similar things should be more similar and less similar things should be made less similar)
- When different nerurons fire at the same time, the weight representing the connection between them should be increased (associative learning).
What are 2 approaches to outler detection?
Approach depends on the format of the data
- generalized approach: build a model of the normal data
- graphical: visualize the data using various means - typically subjective
What is spaital outlier detection?
looks at finding abbrupt changes in behavioural attributes which violate spatial autocorrection and hetreroscedaticity
Explain the density based method for outlier dtection
assumption that outliers are points in regions of low density
What are interestingness measures?
used to prune/rank the derived patterns because association rules tend to produce too many rules and many of them are uninteresting or redundant.
Explain the statistic freamwork for SSE when evaulating clusters
Compare the SSE to random data
Get SSE values
repreat many times
look at the correlation
What is dvisive hierarchical clustering?
Start with one, all inclusive cluster
At each step, split the cluster until each cluster contains a point (or there are k clusters)
How do support vector machines represent the decision boundary?
Using a subset of the training examples, known as support vectors.
What is the goal of assocation rule mining?
Given a set of transactions T, find rules having:
support >= minsup threshold
confidence >= minconf threshold
How does the FP growth algorithm mine frequent itemsets?
It uses a recursive divided-and-conquer approach
It uses pointrs to assist frequent itemset generation.
It requires preprocessing.
Describe the Apriori algorithm used for assocation rule mining.
Let K=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidate itemsets containing subsets of length k that are infrequent
Count the support for each candidate by scanning the DB
Eliminate cadidates that are ifnrequent, leaving only those that are frequent
What is the issue with using Euclidean distance as a distance measure in algorithms such as k-means clustering?
K-means tends to find round / globular clusters because of the use of Euclidean distance as a distance measure.
What is the difference between a global model and a local model?
A global model is 1 model (SVMs)
A local model combines smaller models (KNN and ANNs)
Recite the algorithm for k-means
select k points as the initial centroids (usually random points)
repeat:
form k clusters by assigning all points to the closet centroid
recompute the centroid of each cluster
until: the centroids don’t change or stopping critereon is met
What are the two main types of clustering?
Partiitional
Hierarchical



