Parametric Tests and Assumptions Flashcards

Question

What does the little i next to the y(i) in the below equation represent? y(i) = b(0) + b(1)X(1) + e(i)

Answer 1

Each individual.

Answer 2

the Y-intercept (Value of Y when X=0) Most importantly, it is the POINT at which the regression line crosses the X axis.

Answer 3

It's the first predictor, but more specifically it's the regression coefficient for this predictor. It's the EFFECT. Regression coefficient = effect. It's the SLOPE of the regression line, and it's the direction/strength of the relationship. It's the direction and strength of the magnitude between the ASSOCIATION of the x and y variables . So we would repeat this for another x2. so b(2) would become the effect for that X So that's why it's the effect

Answer 4

The e(i) is the difference between the actual and predicted value of Y for the ith person. It's the DIFFERENCE between the actual data point and the line that we drew in the data points - it's each persons residual or error.

Answer 5

Because we can't predict everything perfectly. Plotting true data points won't always follow a straight line. They will fall a bit off the line.

Answer 6

That X1 and X2 predicts y. And that Y is an outcome of the additive combination of the EFFECTS of X1 and X2.

Answer 7

- By plotting the observed vs. predicted values (where we would want to see them symmetrically distributed around a diagonal line) (like QQ plot) - By plotting residuals vs predicted values (when you have horizontal line and symmetrically distributed dots around it)

Answer 8

Looking out for a bow shape. Or just in general if it's looking like the dots are curving away from the diagonal line.

Answer 9

- By applying a NONLINEAR transformation to variables - By adding another regressor that is a nonlinear function - polynomial curve - Examine the moderators

Answer 10

Not about data being normally distributed only. But, the normal distribution is relevant to: - Parameters (sampling distribution) - Residuals / Error Terms (confidence intervals around a parameter or null hypothesis significant testing)

Answer 11

Because the CLT says as the SAMPLE size gets closer to positive infinity (larger) then the sampling distribution, NOT the data, approaches normality.

Answer 12

As the sample size increases towards infinity, the sampling distribution approaches normal. There is an equal probability of selecting a value 0 to 1, therefore it's uniform. In bold: The CLT says the means are normally distributed . So, the means were calculated using data from a uniform distribution, but the means themselves are NOT uniformly distributed. Instead, the means are NORMALLY distributed. If you collect samples from distributions, whatever types, the means will be normally distributed. CLT says who cares where your data comes from. The sample means will always be normally distributed. So we don't need to worry about distribution. That's why we look at normality in a different way for this assumption of normality for parametric tests.

Answer 13

- Make confidence intervals - Do T tests that ask if there is a difference between the means from two samples - ANOVA where we ask if there is a difference among the means of three or more samples - and any other statistical test that uses a sample mean

Answer 14

False. This is a rule of thumb that is generally considered safe but you can break the rule - Michelle used a sample size of 20 once.

Answer 15

In order for it to work at all, you have to be able to calculate a mean from your sample.

Answer 16

FALSE. The sampling distribution will approach normality even if Y not normally distributed

Answer 17

Negative kurtosis

Answer 18

Positive kurtosis

Answer 19

Either Leptokurtic, Mesokurtic, Platykurtic

Answer 20

When the data is looking tall AND fat. Heavy

Answer 21

When the data is looking normal type of height but FAT

Answer 22

When the data is looking short and fat and like a platypus could fit under it. Light

Answer 23

The "heaviness" of the tail

Answer 24

The symmetry of the distribution

Answer 25

Properties of frequency distributions

Answer 26

We are looking at the FREQUENCY of how often the data in that range occurs.

Answer 27

- Check data or residuals using Q-Q plot of Histograms.

Answer 28

It compares sample quantiles to quantiles of normal distribution. It checks the data/errors. If points are mostly on a straight line, normality is better.

Answer 29

Shapiro Wilkes Test | Q-Q Plot

Answer 30

If data differs from a normal distribution. It is testing against the null hypothesis that you DO have a normal distribution. if its p < 0.5 that means data varies significantly from a normal distribution, therefore normality assumption is violated. If p > 0.5. that means data is not statistically significant, therefore is does nOT vary significantly from a normal distribution, i.e normality is not violated. So we want to see it more than 0.5

Answer 31

That data varies significantly from a normal distribution, better than at least 95% chance

Answer 32

That all groups or data points have the same or similar variance = the assumption of equal variances. If they have equal variances, there is homoescedasticity! If they do not, they have heteroscedasticity.

Answer 33

The error, or the residual

Answer 34

The error from what we predicted the y would be based on its x value, AND what we ACTUALLY observed from the true data/ So predicted versus error (residual)

Answer 35

NO. That's homoscedasticity

Answer 36

- assumes that the residuals are unrelated (independent) of each other, which means mostly you don't have repeated measures of data - typically assumed based on study design as difficult to assess without knowledge - IF OBSERVATIONS ARE NON INDEPENDENT (so data is correlated with each other as data points come from same person/unit we would see downwardly biased standard errors

Answer 37

That means data is correlated with each other so downwardly biased standard errors . So the observations rely on one another. It is ok If this is the case but need to do something different

Answer 38

An outlier when considering ONLY the distribution of the variable it belongs to

Answer 39

An outlier when considering the JOINT distribution of two variables

Answer 40

Outliers when simultaneously considering multiple variables. Difficult to assess using numbers or graphs

Answer 41

Univariate outliers

Answer 42

Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response. In linear regression, coefficients are the values that multiply the predictor values. Suppose you have the following regression equation: y = 3X + 5. In this equation, +3 is the coefficient, X is the predictor, and +5 is the constant. The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable. A positive sign indicates that as the predictor variable increases, the response variable also increases. A negative sign indicates that as the predictor variable increases, the response variable decreases. The coefficient value represents the mean change in the response given a one unit change in the predictor. For example, if a coefficient is +3, the mean response value increases by 3 for every one unit change in the predictor.

Answer 43

Bivariate means it is BREAKING AWAY FROM THE PATTERN OF THE ASSOCIATION BETWEEN YOUR TWO VARIABLES. So what you would usually see is a clear straight line between variable A and B, through those data points, but the other data point would be way off.

Answer 44

Remove it, or trim the data Transform the data Change the score through winsorizing

Answer 45

- Change the score to the next highest value plus some small number (eg 1 or whatever appropriate to data) - convert the score to that expected for a z score of +-3.29 - convert the score to the mean plus 2 or 3 SD - convert the score to a percentile of the distribution (e.g 0.5th or 99.5th percent)

Answer 46

a predefined quantum of the smallest and/or the largest values are replaced by less extreme values. Thereby the substitute values are the most extreme retained values.

Answer 47

No. It is changing the value of the data point. Goal is to keep the data in without driving the effect

Answer 48

- for convenience or ease of interpretation - standardisation, e.g z scores allow for simpler comparisons - Reducing skewness - help get closer to normaity assumption - equalising spread of improving homogeneity of variance - produce approximately equal spreads - linearising relationships between variables - to fit non-linear relationships into linear models - making relationships additive and therefore fulfilling assumptions for certain tests

Answer 49

1. Adding a constant to each number e.g x + 1 2. Converting raw scores to z-scores (x - m)/SD 3. Mean centring (x - m)

Answer 50

1. Log, log(x) or ln(x) 2. Square root of x 3. Reciprocal, 1/x

Answer 51

Log because log in general is defined for positive values (can't have negative values or zero in data set)

Answer 52

TRUE. Can only have positive

Answer 53

NO. Only zero or positive.

Answer 54

Can reduce the impact of large scores and stabilise variance. Defined for ZERO and POSITIVE values.

Answer 55

Because non-linear transformations change the shape of the distribution

Answer 56

Non-linear transformations only. Log, square root, reciprocal

Answer 57

In linear transformations, might increase by 1 and this would mean all values go up by one. But in non linear, a 1 unit increase from 1 to 2 is a .693 increase, and an increase from 10 to 11 is .095. So 1 unit increase is no guarantees to mean true 1 value.

Answer 58

Because it can hinder rather than help if wrong one is applied. For example, not truly adding 1 unit and instead through log transformation you might be adding .693. Transformations can also make interpretation more difficult

Answer 59

It should be referred to as the log variable and not referring to the original raw values. Only the log values now.

Answer 60

Because you can visualise your distributions as well as identify any outliers. The go to plot is a Histogram

Answer 61

ONE. Not talking about plotting one variable against another, like in a scatterplot.

Answer 62

How many people/the frequency/had each score (not another outcome variable)

Answer 63

Values of the variable

Answer 64

The median

Answer 65

no. But can see where distribution lies, range of interquartile, and total range on whiskers. and outliers

Answer 66

Error bars show uncertainty/error in the data, which tells about variability in data.

Answer 67

Confidence intervals

Answer 68

A 95% CI is the range where we are 95% that the range is likely to include any population value. So, it is a measure of error and uncertainty that the population values could still reasonably fit within that range, and that’s what you’re saying on the right hand side, the 95% CI go from 14 to 16 with a mean of 15. And again, those CI look like they overlap a lot between male participants and female participants.

Answer 69

Associations between two variables, so important to see if sig. relatinship, as scatterplot will show us that as data points increase in ONE< they either decrease or increase in another variable.

Parametric Tests and Assumptions Flashcards

(103 cards)