STA2300 Flashcards

(75 cards)

1
Q

M1-4: What are quantitative, categorical and ordinal variables?

A

Quantitative: Take on numerical values. Can find the average (ie height, heart rate, etc.)

Categorical: Definite categories (ie male or female). Doesn’t make sense to average. May be coded on SPSS.

Ordinal: Categorical data in a set order (ie survey - disagree, neutral, agree, etc.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

M1-4: What graphs should we use for quantitative variables?

A

Stem and leaf plot & histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

M1-4: What graphs should we use for Categorical variables?

A

Bar chart & pie chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

M1-4: What 3 features do we look at in graphs of quantitative variables (stem and leaf, boxplot & histogram)?

A

i) Shape - number of modes / peaks, symmetry, deviations, etc.
ii) Centre - a typical approximate value
iii) Spread - the range of values the data can take.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

M1-4: What is the 5 number summary?

A

Minimum, Quartile 1, Median, Quartile 2, Maximum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

M1-4: What characterises the Normal model?

A

Mean (mu) and SD (sigma) as well as bell-shaped approximation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

M1-4: z-score is the number of standard deviations the observation is above the mean. Converting to a z-score, is a process called ________. What is the formula for this?

A

standardising

z = (y-μ) / σ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

M1-4: Converting z-scores to y is a process called _______? What is the formula for this? ** (not on formula sheet) **

A

unstandardising

y = μ + z σ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

M1-4: What is correlation?

A

Measures the direction and strength of linear relationship between two quantitative variables. It is measured using the coefficient r (only if linear).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

M1-4: R^2 measures what?

A

Strength only of a relationship between two quantitative variables. Normally expressed as a percentage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

M1-4: What is the general form of a regression line? What do the components represent?

A

ŷ = b0 + b1x

ŷ denotes predicted value of y
b0 is the intercept
b1 is the slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

M?? - What are the 5 guidelines to supporting P-values and conclusions?

A

> 10%: Insufficient evidence to support Ha (re-state Ha)
5-10%: Slight evidence to support Ha (re-state Ha)
1-5%: Moderate evidence to support Ha (re-state Ha)
0.1 - 1%: Strong evidence to support Ha (re-state Ha)
< 0.1%: Very strong evidence to support Ha (re-state Ha)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

M??? - There are rows on the formula sheet main page. What does each row provide the formulas and characters for, for both hypothesis testing and Confidence Intervals?

A
  • The first row is for proportions
  • The second row is for one-sample mean
  • The third line is the two-sample mean
  • The last line is for paired means
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

M1-4: What are response and explanatory variables? What axis do they go on?

A

A response (dependent) variable is a particular quantity that we ask a question about in our study. We put it on the Y-AXIS.

An explanatory (independent) variable is any factor that can influence the response variable. We put it on the X-AXIS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

M1-4: What are formulas for mean and standard deviations of a binomial?

A

The mean µ of a binomial is np.

The SD σ of a binomial is √npq

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

M7: What is p-hat?

A

p̂ is a sample proportion statistic. It is a variable and has a distribution. Larger sample sizes means the mean stays similar, the spread gets smaller and sample proportion looks more Normal.

It is calculated by X / n, where n is the sample size and X is the number of occurrences of the desired event by sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

M7: What is SD(y bar)?

A

SD(y bar) = sigma / square root of n.

It refers to the sample standard deviation.

Used in questions like: “The annual household income in Brisbane is known to be $72000 with a standard deviation of $12000. If we randomly select 80 incomes from this population, what is the probability that the average income in the sample is more than $75000?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

M7: Law of large numbers states that as sample size increases from a population with mean µ, what happens to sample mean y¯ of observed values?

A

It gets closer and closer to the population mean μ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

M7: What is a standard error?

A

The SD of any sample proportion. It is found by the square root of (p hat x q hat / n).

So, where question is:
Suppose that 20% of a random sample of n = 64 Data Analysis students receive an A for the subject. What is the standard error of the sample proportion?

We get square root of ((0.2 x 0.8) / 64) = 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

M7: How would you describe the distribution of sample proportions?

A

The distribution of sample proportions is approximately normal with mean=p and standard error = square root of (pq / n).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

M7: What is a sample proportion and how can it be identified?

A

It is when the question gives a p value. p and p-hat are not used in sample means (y and y hat are).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

M7: What is x bar in statistics?

A

x-bar is used to represent the sample mean, a statistic, which is used to estimate the true population parameter, μ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

M8: The statement “there is a 95% probability that the population mean is between 350 and 400” may also mean what?

A

The 95% confidence interval for the population mean is (350, 400).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

M8: Does increasing the sample size increase or decrease the confidence interval width, and why?

A

It decreases it, as it decreases the STANDARD ERROR, the statistic whereby n value is computed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
* M8: What does statistical inference refer to?
Drawing conclusions about parameters.
26
M8: What is the Standard Error of the sampling distribution of a proportions question?
SE(p-hat) = square root of ((p-hat x q-hat) / n)
27
* M8: To halve the margin of error at the same level of confidence, what do you need to do?
Find ME (critical value x SE(statistic)) and alter the n value in the SE(statistic) to work.
28
M8: How do you find the ME?
ME can be found from critical value x SE(statistic).
29
M9: As the sample size increases, the Margin of Error ______ ?
Decreases. The more samples / information you have, the more accurate your data is going to be, hence a smaller ME. Large samples mean the ME nears zero.
30
M??: What is the Centre of a distribution?
i) Look at a graph, or a list of the numbers, and see if the center is obvious. ii) Find the mean, the “average” of the data set. iii) Find the median, the middle number.
31
M9: How do you find t*?
Use the Table T with df (n-1) and the CI % (ie 80, 90, 95, 98, 99).
32
M9: For a question such as: A remote controlled car runs on two AA batteries. To estimate the average battery life, 50 helicopters are tested, and the batteries are found to have a mean life of 60.1 hours and standard deviation of 4.3 hours. A 99% confidence interval for the true battery life in hours is:? What row of the formula sheet do we use, and what type of problem is it? What are the statistics and parameters?
It is a confidence interval problem and a one-sample mean problem. We use the 2nd row of the sheet. The statistic is y-bar (sample mean) and the parameter is µ (mean).
33
M9: A lower confidence interval has what?
The same centre and a narrower spread. While it is obvious the centre does not change, the spread can't be wider. This is due to the critical value determining the width of the confidence interval, and decreasing the CI decreases the critical value.
34
M9: What does the term Confidence Interval imply?
If many such confidence intervals were calculated by repeated sampling, X% of such intervals would contain the true population mean.
35
M10: What do you do when asked to find μd, but no value for μ is given?
We assume H0 is true and assign μ = 0
36
M10: What is a matched pairs test?
A matched pair is when an observation in the first sample is matched to an observation in the second. The scores are paired from each formulation: there are two scores for each subject. We must compute the differences and treat them like a single sample.
37
M10: What are two-independent samples tests?
Examples of this include: - Comparing on campus and off campus students - Comparing yields after using two fertilisers t = ((ȳA-ȳB)-(µA- µB))/(SE(ȳA-ȳB))
38
M10: The mean amount of weekly overtime worked by administration workers in Australia has increased in recent years on the basis of surveys taken in 2005 and 2015. The increase has a P value of 0.001. Which of the following can we conclude? Why is the correct answer: We cannot say by exactly how much the mean amount of overtime has increased, only that the observed increase is unlikely to have arisen by chance alone?
Because the P-value indicates, as per the hypothesis test, the likelihood of any significance being attributed to chance.
39
M10: In a matched pairs question, where two samples have n = 16 and n = 15, what is the df?
15 - 1 = 14. Remember we always use the smallest.
40
M10: What is a hypothesis tests for one sample mean?
For a test with one mean, SD(statistic) = σ/(√n). However, sigma is almost always unknown, so we use the SE(statistic) = s/(√n) We then have: t = (ȳ-µ)/(SE(ȳ)), where SE(ȳ) is shown above.
41
M10: In a two independent samples test hypotheses, should we: a) Use d to indicate the difference in the hypothesis? b) Define the difference using μ in the hypothesis?
b) Define the difference using μ in the hypothesis.
42
M11: What are the two hypotheses of the chi-square test?
H0 is that A and B are not associated. Ha is that A and B are associated. Always describe the variables A and B though.
43
M11: What 4 tests do we use to perform a X^2 (Chi-Square) test of independence?
i) State the hypotheses ii) Compute the test statistic iii) Compute the P-value iv) Make a conclusion
44
M11: When computing the test statistic in a Chi-Square test, what do we do?
We first compute the expected count for each cell. This is found by: Expected count = (Row total x column total /)/(Table total) We then compute test statistic: X^2 = Σ((Observed count-expected count)^2)/(Expected count)). Sigma here tells us that this is for EACH CELL.
45
M11: What is a Chi-Square test used for?
A chi square (X2) statistic is used to investigate whether distributions of categorical variables differ from one another.
46
M11: What is the general rule about tails of Chi-Square tests?
They are ALL two-tailed.
47
M11: In a Chi-Square table, if the X^2 value of 5.208 and the df = 1, giving a P-value of between 2.5% and 1%, is there a significant association at the 1% level of significance? Why / why not?
No, there isn't at the 1% because the P-value is higher than alpha (LoS). However, at the 5% and 10% there is, because the P-value is lower. Alpha (LoS) refers to the evidential cut off point required for Ha to be verified or assumed correct.
48
M1: What is a discrete variable?
A discrete variable is able to be counted in a finite amount of time (eg money in pocket, grains of sand on a beach (although that may take a while), etc.)
49
M1: What is a continuous variable?
A continuous variable would literally take forever to count. You can't count age, for example, or time, because it continues. You can however, turn age into a discrete variable by specifying a timeframe (ie, a humans age in years).
50
M1: If a contingency table question says "how many supporters were female"?, and there are yes and no columns, what do you need to count?
You need to count the females who support (usually with a yes) the question outlined. This is a trick question, made to look easy.
51
M1: Complete this: In a contingency table, _____ totals produce a marginal distribution. In a contingency table, _____ totals also produce a marginal distribution. Each column produces a _____ distribution. Each row also produces a _____ distribution.
In a contingency table, column totals produce a marginal distribution. In a contingency table, row totals also produce a marginal distribution. Each column produces a conditional distribution. Each row also produces a conditional distribution.
52
M11: What is the formula for standardised residual?
(observed-expected)/(√expected). Residuals tell us how much predictions miss by.
53
M1: Assume x = $20. If you add x to a dataset with a mean of $411 and a standard deviation of $115.6, what would the new mean and SD be?
New mean: $431 New SD: $115.6 The means would increase by that amount, while standard deviations stay the same.
54
M1: If you add 20% on to all values in a dataset, what happens to the mean and standard deviation?
They BOTH increase by 20%.
55
M11: After calculating the residual between, say Greens voters and age 50+, you get a value of -2.53. What does this mean?
We observe less greens voters in the 50+ age | category.
56
M??: How do I know when a mean is y bar, and when it is mu?
Mu is the population mean, whilst y bar is the sample mean. If given a dataset of a sample and asked to calculate the mean, then it would be y bar. Same with s and sigma - if asked to calculate the standard deviation from a sample, then it is s.
57
M2: What do we use to measure both the centre and the spread in: i) Approximately symmetric shapes ii) Asymmetric shapes
i) Symmetric: Mean for measuring centre Standard Deviation for measuring spread ii) Asymmetric: Median for measuring centre IQR for measuring spread
58
M2: If data has fractions (ie. 11.32%), is it discrete or continuous?
Continuous
59
M2: What are the best graphs for i) Distribution of one quantitative variable if n is small? ii) Displaying distribution of one quantitative variable if n is large? iii) Comparing distribution of quantitative variable across 2 or more groups? iv) Distribution of two or more quantitative variables v) Distribution of categorical variable vi) Distribution of two categorical variables
i) Stem and leaf plot ii) Histogram iii) Side-by-side boxplot iv) Scatterplot v) Bar Graph vi) Contingency tables Remember: Quantitative: Take on numerical values. Can find the average (ie height, heart rate, etc.) Categorical: Definite categories (ie male or female). Doesn't make sense to average. May be coded on SPSS.
60
M3: What is unstandardising?
When given the mean and SD of the population and asked to work backwards. This is different from standardising which is when you're given the mean and SD of the population and asked to find a probability. These come under the Normal Curves topic.
61
M5: What is stratified sampling?
Stratified sampling is where we split the population into known stratas (groups of similar cases, ie males or females, grades, on-campus / off campus students). If there are 100 students studying a class, and 70% are off-campus, then when we choose the stratified sample we'd want to ensure 7 of the 10 we choose for sampling are off-campus students too.
62
M5: What is cluster sampling?
Cluster sampling is where we first select groups (clusters) of cases. Each cluster is considered representative of the whole population, ie similar to another. It requires a census to be performed within that cluster (ie a school, suburb, etc.). We might randomly select 10 schools within Brisbane, and perform census within each school.
63
M5: What is a Simple Random Sample?
An SRS is where every sample of the same size has the same chance of being selected. Therefore, each case has equal chance of being selected.
64
M5: What is the difference between an observational study and an experimental study?
i) Observational studies - researcher observes cases; no intervention. An example is observing where someone might give birth. ii) Experimental studies - impose treatments to observe changes. An example is plant growth trials. ONLY EXPERIMENTS CAN ESTABLISH CAUSE AND EFFECT
65
M5: What is the difference between a blind and double blind study?
A blind experiment is when only the person receiving the treatment doesn't know what treatment they're receiving. A double blind experiment is where neither the subject nor the experimenter know who is receiving which treatment.
66
M6: To find the mean and SD using a table, what do we need to do?
``` Mean = SIGMA(x times probability) SD = SIGMA[(x – mean)^2 times probability ] ``` Remember, SIGMA refers to the sum of. This means that we need to add up all values of the rest of the sum (in a table). Eg. for SD, we need to add up the (value of cell x (usually a quantitative title) minus the mean) squared, and then multiply that by the probability p (or whatever is in the cells).
67
M6: Should we use the binomial model for discrete or continuous data?
Discrete.
68
M7: A ______ is a value calculated from sample, while ______ are from population.
A statistic is a value calculated from sample, while parameters are from population.
69
M7: For sampling distribution of the mean: Shape = _____? Mean = _____ ? Standard deviation = _____ ?
Shape = Approximately normal Mean = µ Standard deviation = σ/√n
70
M7: For sampling distribution of proportions: Shape = _____? Mean = _____ ? Standard deviation = _____ ?
Shape = Approximately normal Mean = p Standard deviation = √(pq/n)
71
M7: The difference between finding the probability of just a single individual or subject as opposed to a sample size, formulaically, is:
Single individual: z = (y - µ) / σ Sample size of any given amount: use z = (y-bar - µ) / (σ/ √n). This is because in the second formula, you need to calculate SD, whereas in the first you don't have any need to use n and so it is just the standard deviation σ in latter part of the equation.
72
M4: What are the 4 components of a scatterplot we need to describe?
i) Form - Whether it is approximately linear or curved. ii) Direction - Positive? (high values of one accompany high values of another) or negative? (high values of one accompany low values of another). iii) Scatter - Small, moderate or large, depending on how close the points are. iv) Outliers - Any point that doesn't fit the pattern of the scatterplot.
73
M4: What is the form for a line of regression? PUT ON CHEAT SHEAT
y hat = b0 + b1x where the y hat indicates a prediction, b0 is the intercept, b1 is the slope. ``` The slope (b1) tells us how much y changes when x changes by 1 unit The intercept (b0) tells us the value of y when x is zero ``` For SPSS Output, b0 is the first row in coefficients, b1 is the 2nd row in coefficients. ALWAYS DEFINE y (dependant / response) and X (independent / explanatory)
74
M??: When doing means of two-independent samples, how do you find the statistic and SD?
The statistic y bar 1 - y bar 2 comes from the difference between the two sample means. The SD's are held for each individual case, under S1, S2.
75
M10: When doing matched pairs data, what do d bar and sd refer to?
d bar is the mean difference sd is the SD of the differences (found by entering the differences of all values into calculator and finding the SD) The rest of the equation is the same as one sample data.