Final Exam Flashcards
(53 cards)
What is False Negative Rate?
fraction of positive examples predicted as negative class
FNR = FN / (FN + TP)
What 2 things is a ROC curve useful for?
- Picking the best model after repeatedly adjusting a specific parameter (t)
- Comparing the relative performance among classifiers
What are 2 methods to evaluate a model?
Out of bag estimation
Cross validation
What is F-measure?
summarizes precision and recall
2TP / ( 2TP + FP + FN)
What are the 3 components of the curse of dimensionality?
- Runtime: if the algorithm does not behave at most linear with the # of attributes, the runtime increase too quickly
- Amount of Data: The amount of samples needed to cover the space with equal density grows exponentially with the number of dimensions
- Distances: distances between the data becomes meaningless. The maximum distance between data points does not grow linearly with the # of dimensions.
In the Apriori frequent itemset generation algorithm, what is a hash tree used for?
Determining if each enuerated k-itemset corresponds to an existing candidate itemset.
Describe the characteristics of the K-means algorithm
- deterministic? no (difference results each time)
- suceptible to falling in a local minimum? yes
- need to specify # of clusters? yes
- is noise removed? no
- partitioned based? yes
- can handle varying densities? yes
What are the 3 types of data sets?
Record
- data matrix
- document data
- transaction data
Graph
- web data
- molecular structures
Ordered
- spatial
- temporal
- sequence
- sequential
What is boosting with respect to data mining and ensemble methods?
- Give more emphasis on specific examples that are difficult to classify. Assign a higher weight, greater probability of being selected to them.
- Records that are wrongly classified will have their weights increased.
- Records that are classified correctly will have their weights decreased.
What is Principal Component Analysis?
It’s a preprocessing technique that maps the data to lower dimensional space
It captures the largest variation
reduces noise and dimensions
“knee in curve”
Describe how self organaizing maps work
Randomly initialize map
repeat:
compare a sample and all input datum in map
select the prottype that is most imilar
update the winner and neighbourhood to more ismilar to winner
Why is sampling important with respect to a training and test set?
We want to reduce the amount of bias.
What are the 5 common clustering algorithms?
K-means
Hierarchical
DBscan
Expectation Maximization
Self Organizing Maps
What is spatial autocorrelation?
objecsts that are physically close tend to be similar
Describe the characteristics of expectation maximization
deterministic? no (random component at start)
suceptible to falling in a local minimum? yes
need to specify # of clusters? no (run multiple times)
is noise removed? no
partitioned based? yes
can handle varying densities? not sure…
What are the 3 normalization schemes used in class?
min-max
z-score
decimal scaling
What is the mahalnanobis distance?
- takes the average values of pairs of attributes and subtracts them from the mean of values
- an alternative to normalizzation (scaling is built in)
- take into account the dspread of the data in a direction
What are the two different types of mappings?
1 of n
n of m
What is recall?
the fraction of positive examples correctly predicted by the classifier
r = TP / (TP + FN)
What are the 3 types of missing values?
Missing completely at random
- scratch random items out
- can subsitute mean but effects variance
Missing at Random
- value missing is due to value of another variable
Non Ignorable Data
- Value missing is due to limitations of measuring device
explain the 1 of n mapping:
create 1 new attribute for every ordinal value
What is the formula for the bias / variance tradeoff?
Mean Squared Error = bias2 + variance
What is expectation maximization (EM)?
Represents data as distrubutions (usually nomal distribution)
Randomly initialize parameters (mean and standard deviation)
- Expectation: calculate likelyhood of falling into distribution based on parameters & assign data to parameters
- Maximization: find parameters that best represent the assignments
What is the difference between a training and test set?
Training set is used to build a model.
Test set is used to test a model.