quiz2 Flashcards
(33 cards)
there are three types of data: quantitative, ordinal, and nominal. describe them
quantitative: numeric values with magnitude (think numbers)
ordinal: values or categories that can be ordered (think grades)
nominal: vales or categories that cant be ordered(think colours)
what is the point of inferential statistics
to use well chosen samples to come to a probably correct conclusion about the population
what is a probability distribution
the description of probabilities for all possible outcomes
what is covariance. when is it a + covariance and when is it a - covariance
it describes how two variables are related.
+: large y with a large x
-: large y with a small x
in inferential statistics we need to form a hypothesis: we need a null hypothesis and an alternate hypothesis. What’s the difference. What do we need to remember about hypothesises
Alternate hypothesis is usually what we are hoping to conclude, null hypothesis is the opposite.
these two hypothesis have to cover all possibilities
we assume the null is true and look for the data to force us to conclude that it isn’t. If it isn’t true then we have a proof by contradiction and we can assume the alternate is true.
we can never conclude the null is true. We can only falsify it
what does the T-test do and what is it’s null hypothesis
if we have two samples which are both normal and
with equal-variance, the T-test will tell us if the distributions have different means
MUST BE normally-distributed and equal-variance
null hypothesis: means are equal
what does the p-value tell us
all inferential tests end up with a probability. which is the probability of seeing our data if the null hypothesis is true. Alternatively you can think of it as, if the p-value is small we can reject the null hypothesis and accept the alternative hypothesis.
if smaller than 0.05 we reject the null hypothesis. If greater than 0.05 we do not reject it.
what do you do if you don’t know if a distribution is normal or not
use stats.normaltest
where null hypothesis is that: data is normal
stats.normaltest(data).pvalue
if the pvalue returned is >0.05 we can conclude it is normal since we cannot falsify the null hypothesis
what do you do if you don’t know if two distributions have equal variance
use the levene’s test
which has a null hypothesis: two samples do have equal variance
stats.levene(data1, data2).pvalue
if p-value >0.05 we can assume they do have equal variance because we cannot falsify the null hypothesis
we can transform data if it isn’t normal to make it normal enough.
Assuming all data are greater than 0, what are the 4 ways you can transform data and when would they be useful
e^x (if data left-skewed, longer on left)
x^2 (if data left-skewed, longer on left)
root(x) (if data right-skewed, longer on right)
log(x) (if data right-skewed, longer on right)
what is the issue with doing t-tests on 2+ datasets
what should we do to prevent the issue
if you do multiple t-tests, it increases the likely hood that there is an incorrect rejection of the null hypothesis
instead you should use the Bonferroni correction, where you choose a threshold of 0.05/(num of t-tests conducted)
What is ANOVA and its purpose
to test if the means of any groups differ, it is like a t-test but for +2 groups
Musts be
- observations must be independent and identically distributed
- normally distributed
- equal variance
Null hypothesis: groups have the same mean
What does ANOVA not tell you. What do you need to do to find out.
ANOVA tells you that there is a difference in the means (if there is) but we don’t know which groups have the different mean.
Use Post Hoc Analysis, only if we have a ANOVA value of less than 0.05.
ie. the groups do not have the same mean
How do we use the Post Hoc Analysis: Turkey’s HSD and what does it return
use panda.melt to get the data in a format that you want (unpivoted data), and then you can use the post hoc Turkey test
it returns a list of pairs and tells us if they have different means. Reject column tells us if true, they are different
give an example of where you would want to use a one sided tail test rather than two.
what is a way you can conduct this test
if you only want to determine if there is a difference between groups in a specific direction. (ir. will studying get me a better grade)
conduct ur test and look at the p value, we change the signifigance level to 0.10
a two sided test where p < 0.10 is the same as one sided test is the same as a one sided test where p < 0.05
what is a mann-whitney U-test used and what is it good for
what is the null hypothesis
it is used when you know nothing about ur distribution or you cannot transform it into a normal distribution.
it is used to check for a difference in the distributions of two independent samples
can be used on ordinal or continouous data
null: there is no significant difference in the groups distributions
what is chi-square, what does it do, why is it used
what is the null hypothesis
what does it need to run
chi-square is used for categorical data with little structure
it tells you if the categories are independent
null: categories are independent, (ex. university does not affect ur happiness)
a contingency table
what is a way to produce a contingency table for categorical data for a chi-square test
panda’s cross tab function
when training model that looks linear, what model should you use
Linear Regression: draws a straight line through the input/training data to best fit match and estimates are done on this line
what do you do if you don’t have data that a linear regression can cleanly fit through
use polynomial regression
when validating the training what are you looking for in the score returned by
model.score(X_valid, y_valid)
a high(er) number = better fit. ie a number closer to 1
describe the Naive Bayes method
some times regressions don’t work because it assumes a continuous value that we can map to.
some times we only have categories, so instead we find, which category are you most likely to be apart of
There are a few ways to categorize Naive Bayes is one of them
get the probability of the input being in each category. After, we look for the category with the highest probability and that becomes our category.
For this method we assume that the input features are not related to each other
what are baysian priors used for
it defines the probability of finding each category. We define the likelihood of each category before we start predicting. These predictions give the categories weight before we even look at the data
what is the nearest neighbors method of model training classification
look for k-nearest neighbors to make a prediction about what category you would likely fit into