Exam Flashcards

1
Q

what is statistical learning (also known as machine learning)?

A

relies on the idea that algorithms can learn from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

supervised learning?

A

is task-driven and the data is labelled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

target variable (supervised learning)

A

a variable that we need to gain more information on, or predict the value of.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

true values of the target variable are called??>

A

labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

predictors?

A

they are used in predictive analytics to make predictions on the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

target variables could be:

A

continuous or discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

discrete

A

can have two levels (binary target) or multiple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

classification

A

is used to predict the value of a discrete target variable, given predictor variable values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

continuous target

A

can have large number of possible outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

regression:

A

is used to predict the value of a continuous target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cross-validation

A

technique that evaluates predictive models by partitioning the original model into training set and testing set to evaluate it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

training set

A

to build (train) the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

testing set

A

to evaluate it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

overfitting

A

when the algorithm predicts the training data so well it does not generalize to other models well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

classification

A

when the value to be predicted is a categorical variable, the supervised learning is of type classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Regression

A

when the value to be predicted is a numerical variable the supervised learning is of type regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unsupervised Learning

A

there is no target variable algorithm need to come up with the assignment based on data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Clustering

A

no know classes or categories. algorithm tries to learn of similarities and discover groups of similar data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

association

A

tries to find relationships between different variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

parametric

A

rely on the estimation of parameters of a function, or set of functions, for the purpose of prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

non-parametric

A

do not rely on parameter estimation in order to predict outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

hyper-parameters

A

a non parametric model may still involve the determination of settings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Inherent Error

A

unavoidable. also called ‘noise’ or ‘irreducible error’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Bias

A

due to over-simplifications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

variance

A

due to over-complication. overly complex model will be unable to perfectly generalize and correctly predict the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

K-nearest Neighbors

A

algorithm assigns each data point to a class based on the class of its nearest points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

classification report

A

provides information on different aspects of the classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

precision

A

proportion of correct positive (event) predictions to all positive predictions
Therefore = TP/(TP+FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Recall

A

recall for class x indicates the proportion of correct positive predictions to all true positive cases.
= TP/(TP+FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

F1-score

A

the harmonic mean of the precision and recall for each class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

support

A

support indicates the number of each class we had in our testing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

accuracy

A

number of correct predictions over the total number of predictions

33
Q

ROC Curve (Receiver Operating Characteristic curve)

A

the ROC curve is a plot that depicts how the true positive rate changes with respect to the false positive rate

34
Q

in ROC FalsePositiveRate should be

A

close to 0

35
Q

in ROC TruePositiveRate should be

A

close to 1

36
Q

Scaling

A

helps us bring all features into the same scale

37
Q

Logistic Regression

A

the result will be often mapped to a binary outcome

38
Q

Logistic regression falls into?

A

supervised learning of the classification type

39
Q

Probabilities need to satisfy two conditions

A

Always be positive
always be between 0 and 1

40
Q

odds of an event

A

is the probability of that event over its complement.

41
Q

While probability of an event is always between 0 and 1

A

the odds could be any non-negative value

42
Q

b0,b1 (Logistic Regression)

A

are the estimators of the model. also called predicted weights or the coefficients for each of the features

43
Q

Data profiling

A

understanding what the data entails and identify anomalies, missing values, inconsistencies, etc.

44
Q

data cleansing

A

activities include imputing missing values, removing missing values, addressing outliers, fixing variables that have inconsistent data

45
Q

data structuring

A

bringing data into a structured form used for the analysis

46
Q

data transformation

A

data may need to be transformed rescaled or normalized

47
Q

Data collection

A

if data is not provided to you we need to collect it

48
Q

Simple random sampling

A

each member of the population has the exact same probability of being selected in the sample

49
Q

Systematic sampling

A

members of the population are selected based on a system (set of rules)

50
Q

stratified sampling

A

population is divided into homogeneous slices (strata). Within each slice simple random sampling is performed and he results are combined (reduces sampling bias and improves accuracy of sampling)

51
Q

Cluster sampling

A

the population is divided into subgroups, such that each cluster is a good representative of the population.

52
Q

lower fence

A

Q1 - 1.5IQR

53
Q

upper fence

A

Q3 - 1.5IQR

54
Q

IQR

A

Q3 - Q1

55
Q

data point is an outlier if

A

it is smaller than the lower fence or larger than the upper fence.

56
Q

Dummy variables

A

we do one-hot encoding, variables created using one-hot will be used in place of the categorical variable

57
Q

Label encoding

A

each category of the categorical variable is assigned a number based on some order

58
Q

Regression

A

is a mathematical relationship between the features of a problem and the target variable that is to be predicted.

59
Q

Linear regression

A

is a parametric method, requires a response variable (target) and one or multiple predictor variables (features)

60
Q

the least squares method

A

produces a line that minimizes the sum of squared error

61
Q

y and y hat

A

y is the actual value of the target variable

y-hat is the predicted value of the target variable

62
Q

e (the residual)

A

is the difference between y and y hat

63
Q

R^2

A

coefficient of determination

64
Q

MSE

A

Mean Squared Error

65
Q

RMSE

A

Root Mean Squared Error

66
Q

Coef of Determination

A

is an indicator that determines the goodness of our model’s fit to the data, always between 0 and 1, a higher value is preferred

67
Q

Mean Squared Error

A

measure that evaluates the average of the squared deviation between the values of the target and the predicted values of the target. Smaller values of MSE are preferred. a value of 0 is ideal but not possible

68
Q

root mean squared error

A

RMSE is the average amount of deviation of data points from the regression line

69
Q

Adj R^2

A

explicitly accounts for the number of explanatory variables. It is common to use adjusted R^2 for model selection because it imposes a penalty for any additional explanatory variable that is included in the analysis. Only increases when a new variable is added to the model that contributes to the prediction.

70
Q

decision trees

A

the repeated splitting of nodes until we reach pure subsets is the building block of the classification and regression trees (CART) algorithm

71
Q

When the target variable is categorical

A

the decision tree is a classification tree

72
Q

when the target is numerical

A

the decision tree is a regression tree

73
Q

Gini Index

A

measures the degree of impurity of a set of classes in the target variable

74
Q

K mean algorithm step one

A

randomly pick k centroids from the sample points as initial cluster centers

75
Q

K mean algorithm step 2

A

assign each sample to the nearest centroid

76
Q

K mean algorithm step 3

A

Move the centroids to the center of the samples that were assigned to it

77
Q

k mean algorithm step 4

A

repeat step 2 and 3 until maximum number of iterations is reached

78
Q

elbow method

A

find the value of k, where the decrease in inertia slows down as k increases.

79
Q

inertia

A

sum of squared distances between data points in each cluster and their cluster centre