Exam info Flashcards

1
Q

NUMERICAL VARIABLES

A

Numerical:
• Continuous (entities get a distinct score), e.g. temperature,
body length.
• Discrete (counts), e.g.: number of defects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CATEGORICAL VARIABLES

A

Categorical (entities are divided into distinct categories):
• Binary variable (two outcomes), e.g. dead or alive.
• Ordinal variable, e.g. bad, intermediate, good.
• Nominal variable (order not important), e.g. whether someone is an omnivore,
vegetarian or vegan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HYPOTHESIS TESTING

A
  1. State the null-hypothesis H0 and the alternative Ha
  2. Collect evidence (data)
  3. Can H0 be maintained, given
    the evidence?
    if p-value <= 0.05 – Reject H0
    if p-value > 0.05 – Do not reject H0
  4. At the a% significance level, there is(not) sufficient statistical evidence to infer …
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of Hypothesis errors

A
  1. Type I error (α): Reject H0 when H0 is true – Jury convicts an innocent person.
  2. Type II error (β): Do not reject H0 when H0 is false – Jury acquits a guilty person.
  3. Correct decision: Reject H0 when H0 is false – Jury convicts a guilty person.
  4. Correct decision: Do not reject H0 when H0 is true – Jury acquits an innocent person.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Confidence interval

A

Confidence interval - consists of an interval of numbers produced
by a point estimate, and an associated confidence level specifying the probability that the interval contains the
population parameter.
• Confidence intervals have the general form:
Point Estimate +/- Margin of Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Statistical Inference

A

Methods for estimating/predicting and testing hypotheses about population
characteristics based on information contained in a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Population

A

A population is collection of all elements of interest for a particular study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parameter

A

parameter is a characteristic of a population
(e.g., such as the mean number of
customer service calls of all customers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sample

A

A sample is a representative subset of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Statistic

A

A statistic is a characteristic of a sample (e.g., mean number of customer
service calls of the 5000 customers in the sample (1.563)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sample Proportion

A

The sample proportion p, is the statistic used to measure the unknown value of the population proportion p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Point estimation

A

Use of a single known value of a statistic to estimate the associated population
parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

p-value

A

probability of observing a sample statistic at least as extreme as the statistic actually observed,
if we assume that H0= is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

1-sample t-test

A

H0: μ = μ0
Can be used for a numerical variable

the test statistics is t from t distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Test for Proportion

A

H0: π = π0
Can be used for a categorical variable

the test statistics is Z from standard normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two-sample t-test

A

H0: μ1 = μ2
Can be used for a numerical and a binary variable

the test statistics is t from t distribution

17
Q

Two-sample Z-test

A

H0: π1 = π2
Can be used for two binary variables

the test statistics is Z from standard normal distribution

18
Q

Chi-square test

A

H0: π1 = π2 = π3
or
H0: π1=ρ1; π2=ρ2; π3=ρ3
Can be used for two Categorical variables (with > 2 categories)

test statistic is χ^2 from chi-square distribution

19
Q

Analysis of Variance (ANOVA)

A

H0: μ1 = μ2 = μ3
Can be used for a numerical and a categorical ( with > 2 categories) variables.

test statistic is F=MSTR/MSE from F distribution

20
Q

Correlation Test

A

H0: ρ = 0
Can be used for two numerical variables

the test statistics is t

21
Q

k-Nearest Neighbor Algorithm

A

The k-Nearest Neighbor algorithm is an instance-based learning where training
set records are first stored. Next, the classification of a new unclassified record
is performed by comparing it to the most similar records in the training set.

22
Q

Consequences of smaller k

A
  • Choosing a small value for k may lead the algorithm to overfit the data.
  • Noise or outliers may unduly affect classification.
23
Q

Consequences of larger k

A

• Larger values will tend to smooth out idiosyncratic or obscure data values in the
training set.
• If k becomes too large, locally interesting values will be overlooked

24
Q

Overfitting

A

Overfitting occurs when the model tries to fit every possible trend/structure in the
training set.

25
Q

ROC curve

A

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a binary classification model at all classification thresholds.

The y-axis shows the True Positive Rate, which is the same thing as Sensitivity.
The x-axis shows the False Positive Rate, which is the same thing as 1 - Specificity.

26
Q

AUC

A

The Area Under the Curve (AUC) is often a measure of the quality of the classification models.

27
Q

MSE

A

Mean Squared Error (MSE) is a measurement of predictive accuracy. Lower MSE means more accurate classification. (POSITIVE VALUES)

28
Q

Decision Tree

A

Decision Tree - tree shaped algorithm used to determine course of action (each branch represents possible decision)

29
Q

Nodes, root nodes, leaf nodes, decision rule

A
Nodes - test which splits into different categories
Root node - node at the top of decision tree
Leaf node - each external node in decision tree (a.k.a. category)
Decision rule (association rule) - all possible paths in decision tree (IF ... and ... THEN ...)
30
Q

Entropy

A

measure of messiness of data collection.

31
Q

Information gain

A

decrease obtained in entropy by splitting data set based on some condition.

32
Q

Dangers of extrapolation

A
Extrapolation - estimating or concluding something by assuming that existing trends will continue.
1. Analysts should restrict estimates 
and predictions to the values within 
the range of the values of x in 
dataset.
33
Q

Residual standard error

A

The value of the Residual standard error indicates the size of the “typical” prediction error.
It has POSITIVE VALUES in which the lower values could be a sign of better predictions.

34
Q

Confusion Matrix

A
Table that shows number of correct and incorrect predictions made by classification model compared to actual outcomes (target value) in data. 
True Positive (TP) --- False Positive (FP)
False Negative (FN) --- True Negative (TN)
35
Q

R-squared statistic r^2

A

Measures how closely the linear regression fits the data (ranges from 0 to 1)
- The model has the R-squared value of 0.7455 which means 75% of the variability of the target variable revenue is explained by our regression model.

36
Q

Four assumptions of linear regression

A
Before implementing a model, the requisite model assumptions must be verified. The assumptions are: 
Linearity of residuals
Independence of residuals
Normal distribution of residuals
Constant variance of residuals
37
Q

Types of tests to validate partition for these types of target variables?

  1. Continuous
  2. Flag/Binary
  3. Multinomial
A
  1. Two-sample t-test for difference in means
  2. Two-sample Z-test for difference in proportions
  3. Test for homogeneity of proportions