Week 2 Flashcards
(15 cards)
Clustering is useful for
1) targetted marketing, customer segmentation
2) Personalized medicine
3) Locating facilities
4) Image analysis
5) Data investigation
Infinity norm
The largest (absolute) of a set of absolute numbers
Data has two types of patterns
real effect - real relationship between attributes and response
random effect - random but looks like a real effect
Flow chart for validation
1) Build models using traning set of data
2) choosing between models?
3) if yes, then choose best model using validation set of data and estimate quality using test set of data
When working with one model, what is the rule of thumb for splitting for two datasets
70-90% training 10-30% test
For validation, what is the rue of thumb for splitting
50-70% training
Split the rest evenly between validation and test
Splitting data approaches
Random: Randomly choose data points for training. Randomly choose points for validation and test
ROtation: Take turns selecting points
Advantage to rotation
Make sure each part of the data is equally separated
Disadvantage to rotation
Bias could be introduced.
After cross validation. We train the model again using all of the data. TF
T
kmeans algorithm
- Pick k cluster centers within range of data
- Assign each data point to nearest cluster center
- Recalculate cluster centers(centroids)
- Repeat until no changes
heurstic
fast, good but not guaranteed to find absolute best solution. Kmeans is an example of a heuristic
KMeans is example of EM algorithm TF
T
Expectation Step of kmeans
FInd cluster centers
Maximization step of kmeans
assign points to clusters