quiz2 Flashcards

1
Q

there are three types of data: quantitative, ordinal, and nominal. describe them

A

quantitative: numeric values with magnitude (think numbers)

ordinal: values or categories that can be ordered (think grades)

nominal: vales or categories that cant be ordered(think colours)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the point of inferential statistics

A

to use well chosen samples to come to a probably correct conclusion about the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a probability distribution

A

the description of probabilities for all possible outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is covariance. when is it a + covariance and when is it a - covariance

A

it describes how two variables are related.

+: large y with a large x

-: large y with a small x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

in inferential statistics we need to form a hypothesis: we need a null hypothesis and an alternate hypothesis. What’s the difference. What do we need to remember about hypothesises

A

Alternate hypothesis is usually what we are hoping to conclude, null hypothesis is the opposite.

these two hypothesis have to cover all possibilities

we assume the null is true and look for the data to force us to conclude that it isn’t. If it isn’t true then we have a proof by contradiction and we can assume the alternate is true.

we can never conclude the null is true. We can only falsify it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does the T-test do and what is it’s null hypothesis

A

if we have two samples which are both normal and
with equal-variance, the T-test will tell us if the distributions have different means

MUST BE normally-distributed and equal-variance

null hypothesis: means are equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what does the p-value tell us

A

all inferential tests end up with a probability. which is the probability of seeing our data if the null hypothesis is true. Alternatively you can think of it as, if the p-value is small we can reject the null hypothesis and accept the alternative hypothesis.

if smaller than 0.05 we reject the null hypothesis. If greater than 0.05 we do not reject it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what do you do if you don’t know if a distribution is normal or not

A

use stats.normaltest
where null hypothesis is that: data is normal

stats.normaltest(data).pvalue

if the pvalue returned is >0.05 we can conclude it is normal since we cannot falsify the null hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what do you do if you don’t know if two distributions have equal variance

A

use the levene’s test

which has a null hypothesis: two samples do have equal variance

stats.levene(data1, data2).pvalue

if p-value >0.05 we can assume they do have equal variance because we cannot falsify the null hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

we can transform data if it isn’t normal to make it normal enough.

Assuming all data are greater than 0, what are the 4 ways you can transform data and when would they be useful

A

e^x (if data left-skewed, longer on left)

x^2 (if data left-skewed, longer on left)

root(x) (if data right-skewed, longer on right)

log(x) (if data right-skewed, longer on right)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the issue with doing t-tests on 2+ datasets

what should we do to prevent the issue

A

if you do multiple t-tests, it increases the likely hood that there is an incorrect rejection of the null hypothesis

instead you should use the Bonferroni correction, where you choose a threshold of 0.05/(num of t-tests conducted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is ANOVA and its purpose

A

to test if the means of any groups differ, it is like a t-test but for +2 groups

Musts be
- observations must be independent and identically distributed
- normally distributed
- equal variance

Null hypothesis: groups have the same mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does ANOVA not tell you. What do you need to do to find out.

A

ANOVA tells you that there is a difference in the means (if there is) but we don’t know which groups have the different mean.

Use Post Hoc Analysis, only if we have a ANOVA value of less than 0.05.

ie. the groups do not have the same mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we use the Post Hoc Analysis: Turkey’s HSD and what does it return

A

use panda.melt to get the data in a format that you want (unpivoted data), and then you can use the post hoc Turkey test

it returns a list of pairs and tells us if they have different means. Reject column tells us if true, they are different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

give an example of where you would want to use a one sided tail test rather than two.

what is a way you can conduct this test

A

if you only want to determine if there is a difference between groups in a specific direction. (ir. will studying get me a better grade)

conduct ur test and look at the p value, we change the signifigance level to 0.10

a two sided test where p < 0.10 is the same as one sided test is the same as a one sided test where p < 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a mann-whitney U-test used and what is it good for

what is the null hypothesis

A

it is used when you know nothing about ur distribution or you cannot transform it into a normal distribution.

it is used to check for a difference in the distributions of two independent samples

can be used on ordinal or continouous data

null: there is no significant difference in the groups distributions

17
Q

what is chi-square, what does it do, why is it used

what is the null hypothesis

what does it need to run

A

chi-square is used for categorical data with little structure

it tells you if the categories are independent

null: categories are independent, (ex. university does not affect ur happiness)

a contingency table

18
Q

what is a way to produce a contingency table for categorical data for a chi-square test

A

panda’s cross tab function

19
Q

when training model that looks linear, what model should you use

A

Linear Regression: draws a straight line through the input/training data to best fit match and estimates are done on this line

19
Q

what do you do if you don’t have data that a linear regression can cleanly fit through

A

use polynomial regression

20
Q

when validating the training what are you looking for in the score returned by

model.score(X_valid, y_valid)

A

a high(er) number = better fit. ie a number closer to 1

21
Q

describe the Naive Bayes method

A

some times regressions don’t work because it assumes a continuous value that we can map to.

some times we only have categories, so instead we find, which category are you most likely to be apart of

There are a few ways to categorize Naive Bayes is one of them

get the probability of the input being in each category. After, we look for the category with the highest probability and that becomes our category.

For this method we assume that the input features are not related to each other

22
Q

what are baysian priors used for

A

it defines the probability of finding each category. We define the likelihood of each category before we start predicting. These predictions give the categories weight before we even look at the data

22
Q

what is the nearest neighbors method of model training classification

A

look for k-nearest neighbors to make a prediction about what category you would likely fit into

22
Q

what is the decision trees classification technique

what are some useful information’s when making decision trees

A

build a big nested if/else structure. At the end of each branch, make a prediction.

we can limit the height, leaf and splits of the decision tree to stop us from over fitting the data

23
Q

what is the point of an ensamble in decision trees

what is random forest in decision trees

what is boosting in decision trees

A

ensemble: combine multiple models to improve overall performance, since basic decision trees can overfit the data

random forests: build multiple decision trees using random subsets of the data. At the end merge the trees together, will increase robustness

boosting: build decision trees sequentially, each tree correcting the error of the previous one. The final prediction is the weighted sum of the individual tree prediction

24
Q

what is the point of PCA (principal component analysis

A

it decreases the number of dimensions in data but keeps the information that matters

25
Q

what is the difference between supervised and unsupervised learning

A

supervised: with with examples where we know what the output should be

unsupervised: there is no known right answer prior

26
Q

what should you do when you want to do a regression but the data doesn’t fit a linear model but you want to predict a number

A

use the k-nearest neighbors regressor, take the neighbors value and find the mean, this is out prediction

random forest regressor: instead of each decison being a category, put a number value at each leaf

neural network regressor: take inputs, weight them somehow, have an activation function to normalize the results. Do some magic to learn the weights from training data Add extra layers of computation for more complex decision

27
Q

what is clustering in ML techniques

A

it is an unsupervised training where you find observations that are similar and group them together into a cluster

differing clustering algorithms will give different results and take different parameters

28
Q

what is anomaly detect in ML techniques

A

it is another unsupervised training where you find unusual observations. you use the unusual observations to try and detect more outliers

29
Q

what is a neural network

A

neural network: a method in AI that teaches the computer to process data in a way that is inspired by the human brain. Where you will have interconnected nodes in a layered structure

30
Q

you have a dataframe called searches and a column called uid, you want to create a new column called ‘whichUI’ based off if the uid is even or odd

write it

A

searches[‘whichUI’] = np.where(searches[‘uid’] % 2 == 0, ‘true’, ‘odd’)