Lecture 18, 19 & 20 Flashcards
(30 cards)
Why is Correlation useful?
Discover relationship/possible causality
One step towards finding Causality
What is Correlation
if there is a relationship between a pair/set of values.
How can Correlations be identified visually?
Via Scatter Plot
Why is Correlation important?
Discover Relationships
Why is Correlation different to Causation?
Because data may have a similar cause such as sunglasses sales vs ice cream sales.
What is Euclidean distance?
Distance between two points x and y.
Why is Euclidean distance shit?
- Different scales of objects so numbers become arbitrary
- Can not discover similar behaviour at different scale
- Can not discover negative correlation
Advantages of Pearson’s correlation
- Range within [ -1 , 1 ]
- Scale Invariant: r(x,y)=r(x,Ky), K is real positive constant
- Location Invariant: r(x,y)=r(x,y+C), C is real positive constant
Disadvantage of Pearson’s correlation
Can not detect non linear relationships
How is Pearson’s correlation calculated?
Practice that shit
Advantages of Mutual Information?
- Range within [ 0, 1]
- Detect non linear relationships
What is Variable Discretization?
Converting from continuous to discrete values via bins
What methods of Variable Discretization are there?
Domain Knowledge, Equal-width bin, Equal frequency bin
What is Domain Knowledge Variable Discretization?
Manually assigning thresholds e.g. Speed
- 0-40km/h Slow
- 40-70km/h Medium
- 70km/h+ fast
What is Equal-width bin Variable Discretization?
Where bins have the same length
What is Equal Frequency Variable Discretization?
Where bins have the same number of points.
What is Entropy?
A measure of the Information Content
What is Classification
Given a training data set find a model for classifying attributes as a function of values of other attributes.
What is Goal of Classification?
To provide previously unseen data and assign a class to it.
What is Regression?
Given a training data set learn a predictive model for the data.
What is required by K Nearest neighbour Classifier?
- Set of records
- Metric to compute distance between records
- The value of k, i.e. the number of neighbours to retrieve.
What is the methodology of K Nearest neighbour Classifier?
- Compute distance to other training records (e.g. euclidean distance/ possibly with weights)
- Identify k nearest neighbours
- Use classes of neighbours to determine the class of the unknown record
Problems with K Nearest Neighbour?
- K needs to be selected carefully
- Large number of points add storage cost and search cost
How do you calculate accuracy?
(TP+TN)/(TP+TN+FP+FN)