Lecture 18, 19 & 20 Flashcards

1
Q

Why is Correlation useful?

A

Discover relationship/possible causality

One step towards finding Causality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Correlation

A

if there is a relationship between a pair/set of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can Correlations be identified visually?

A

Via Scatter Plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is Correlation important?

A

Discover Relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is Correlation different to Causation?

A

Because data may have a similar cause such as sunglasses sales vs ice cream sales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Euclidean distance?

A

Distance between two points x and y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is Euclidean distance shit?

A
  • Different scales of objects so numbers become arbitrary
  • Can not discover similar behaviour at different scale
  • Can not discover negative correlation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Advantages of Pearson’s correlation

A
  • Range within [ -1 , 1 ]
  • Scale Invariant: r(x,y)=r(x,Ky), K is real positive constant
  • Location Invariant: r(x,y)=r(x,y+C), C is real positive constant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Disadvantage of Pearson’s correlation

A

Can not detect non linear relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is Pearson’s correlation calculated?

A

Practice that shit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Advantages of Mutual Information?

A
  • Range within [ 0, 1]

- Detect non linear relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Variable Discretization?

A

Converting from continuous to discrete values via bins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What methods of Variable Discretization are there?

A

Domain Knowledge, Equal-width bin, Equal frequency bin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Domain Knowledge Variable Discretization?

A

Manually assigning thresholds e.g. Speed

  • 0-40km/h Slow
  • 40-70km/h Medium
  • 70km/h+ fast
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Equal-width bin Variable Discretization?

A

Where bins have the same length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Equal Frequency Variable Discretization?

A

Where bins have the same number of points.

17
Q

What is Entropy?

A

A measure of the Information Content

18
Q

What is Classification

A

Given a training data set find a model for classifying attributes as a function of values of other attributes.

19
Q

What is Goal of Classification?

A

To provide previously unseen data and assign a class to it.

20
Q

What is Regression?

A

Given a training data set learn a predictive model for the data.

21
Q

What is required by K Nearest neighbour Classifier?

A
  • Set of records
  • Metric to compute distance between records
  • The value of k, i.e. the number of neighbours to retrieve.
22
Q

What is the methodology of K Nearest neighbour Classifier?

A
  • Compute distance to other training records (e.g. euclidean distance/ possibly with weights)
  • Identify k nearest neighbours
  • Use classes of neighbours to determine the class of the unknown record
23
Q

Problems with K Nearest Neighbour?

A
  • K needs to be selected carefully

- Large number of points add storage cost and search cost

24
Q

How do you calculate accuracy?

A

(TP+TN)/(TP+TN+FP+FN)

25
Q

How do decision trees work?

A

Seriously you need me to answer that ಠ_ಠ

26
Q

Problems with decision trees?

A
  • Determining how to split values

- Determining when to stop splitting

27
Q

How do you specify test condition for a tree?

A

Depends on attribute types and number of splits

28
Q

How to determine the best split?

A

Nodes with homogeneous class distribution with a low level of impurity

29
Q

How is Entropy used to calculate impurity?

A

Entropy formula, 0 when all belong to one class

30
Q

How to determine how good is a split?

A

(Formula from slides) Compare impurity of parent node before split and after split