Statistics Flashcards

(196 cards)

1
Q

All variables can be categorised into two

A

Numerical data
Categorical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Numerical data can be broken down into two

A

Continuous
Discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Categorical data can be broken down into two

A

Regular categorical
Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Numerical

A

Numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Numerical

A

Numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are two types of variables

A

Numerical
Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two types of numerical variables

A

Continuous
Discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two types of categorical data

A

Regular categorical
Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Numerical

A

Numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Categorical data

A

Data that is sorted into categories such as fav colour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Continuous data ( numerical)

A

Data that can take any specific value eg: 24cm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discrete

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Discrete data ( numerical)

A

Data that can increase the number of value eg: class interval between 12-30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Regular categorical data

A

Types of data divided into groups such as: race, age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ordinal data ( categorical)

A

Non numerical piece of information with implied order eg: medium, hot, mild

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explanatory variable ( independent variable)

A

Used to predict changes in another variable (dependant variable).
Like cause n effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Observational studies

A

Where researchers observe subjects without manipulating treatments or assigning treatments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Experiment

A

Is a study where researchers deliberating manipulate one or more variables to observe the response on a response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Variables that aren’t associated are called

A

Independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Variables that show some sort of connection with one another

A

Dependant variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Association does cause

A

Causation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Census

A

Sample of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Problems with census

A
  • people may find it hard to measure as their may be outliers
  • population changes, hard to have accurate census
  • taking census complicated than sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Exploratory analysis

A

Process of analysing data to summarise main features, using visual methods before applying formal models or testing hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Inference
Process of Drawing conclusions about a population based on data from a sample
26
Anecdotal evidence
Informal evidence based on personal stories, individual cases rather than scientific data or systematic research
27
The sample needs to representative of the entire population
For inference to be valid
28
Non response
Occurs when individuals selected for a survey or study doesn’t respond, leading to missing data and potential bias in results
29
Voluntary response
Occurs when sample consists of people who volunteer to respond because they have strong opinions on the issue
30
Convience sample
Non- random sample made up of individuals who are easy to reach out to. That can lead to bias results
31
What are the three random sampling techniques?
- simple - stratified - cluster sampling
32
Stratified sampling
The population is divided into groups (strata), and a random sample is taken from each group.
33
Simple sampling
Every individual in the population has an equal chance of being selected
34
Cluster sampling
The population is divided into groups (clusters), some clusters are randomly selected, and all individuals within chosen clusters are surveyed.
35
Placebo
Fake treatment used as the control group for medical studies
36
Placebo effect
Participants behaviour caused by belief and receiving the treatment not the treatment itself
37
Blinding
When experimental units don’t know whether they are in the control or treatment group
38
Double blind
When both experimental units and the researchers who interact with patients don’t know who the control group of who is in the treatment group
39
Steps in experiment
1. Control for the potential effect of other variables 2. Randomise : randomly assigning subjects to treatments and random sample from population 3. Replicate : replicate by collecting a large sample/ entire 4. Block: if variables affect the response
40
Scatterplots
Type of graph that displays data on a two dimensional plane shows relationship between two variables
41
What is the correlation between children per woman and fertility rate?
Negative correlation
42
What is correlation?
A statistical measure that describes the strength and direction of the relationship between two variables.
43
What is a positive correlation?
As one variable increases, the other also increases.
44
What is a negative correlation?
As one variable increases, the other decreases.
45
What is no correlation?
No predictable relationship between the variables.
46
What does a correlation of 1 mean?
Perfect positive correlation.
47
What does a correlation of -1 mean?
Perfect negative correlation
48
What does a correlation of 0 mean?
No correlation
49
Dot plot
Graph that displays individual data points as dots along a number line
50
Darker areas represent where there are more
Observations
51
Sample mean formula and def
Average set of values from a sample of a population
52
Population mean formula
Average of all values in an entire population
53
Why is the sample mean sample static?
Because it’s calculated from a sample
54
Why does sample mean serve as a point estimate of population mean?
Provides estimate of population true average
55
Stacked dot plot
displays data points as dots stacked vertically for identical values, making it easy to see the frequency of each value in the dataset
56
Histogram
is a bar chart that displays the frequency of data within specified intervals (bins), used to visualize the distribution of continuous data
57
There is three shapes of distributions
Unimodal Bimodal/ multimodal Uniform
58
Unimodal distribution
Is data with a single peak or mode
59
Biomodal distribution
Is data with two peaks
60
Multimodal distribution
Is data with multiple peaks
61
Uniform distribution
Evenly spread data where all outcomes are likely
62
The two types of skewness are
Right skewed Left skewed
63
Right skewness ( positive skew)
Majority of data is on the left, right tail (larger values) is longer
64
Left skewness ( negative skew)
Left tail ( smaller values) is longer and majority of data is on the right
65
Sample variance
Spread of data points in a sample
66
We square deviations to
Get rid of negatives so observations equally distant from mean are weighed equally
67
Absolute value
Leaves the sign of the number and leaves its only magnitude making it non negative
68
Sample standard deviation
The average distance of data points from sample mean. Square root of variance
69
Median
Is the middle value if the data set when ordered in ascending
70
Median is midpoint of data
Known as 50th percentile
71
Quartile 1
25th percentile
72
Quartile 3
75th percentile
73
Interquartile range (IR)
Difference between third quartile and first quartile. Identifies central portion where data lies
74
Box plot
Graphically definition of data set shows median, quartiles and outliers
75
Outliers
Data points that are significantly different from rest of data in the data set
76
Whiskers
Show range of data
77
Benefits of outliers
- helps indentify extreme skew in distribution - indemnifies data collection and entry errors - provides insight into interesting features of data
78
Random processes
Sequence of outcomes that evolve over time with uncertainty where each result is determined by chance
79
Frequentist probability
Proportion of times an event occurs in repeated experiments
80
Bayesian interpretation
Probability as a measure of belief based on prior knowledge and updated with new evidence
81
Law of large numbers R
States that as sample increases the sample mean will get closer to the population mean
82
The Gamber’s Fallacy
Mistaken belief that past random events affect the probabilities of future random events
83
Mutually exclusive (disjoint) outcomes
Are outcomes that cannot happen at the same time.
84
Example of a mutually exclusive outcome
Flipping heads and tails at the same time
85
Non- disjoint outcomes
Outcomes that can happen at the same time. Might influence each others likelihood
86
Example of a non disjoint outcome
Drawing a card that is queen and red from The deck
87
Rules of probability distributions
1. Events listed must be disjoint 2. Each probability must be between 0 and 1 3. Probability must be total one
88
Complementary events
are two events where one event represents all outcomes that are not in the other event.
89
Sample space
The complete set of all possible outcomes in a probability experiment.
90
Two outcomes can be DEPENDANT
When the outcome of one affects the probability of the other
91
Two ouctomes can be INDEPENDENT
outcome of one doesn’t change the probability of other
92
What does a difference in conditional probabilities suggest?
It may indicate dependence between two variables.
93
What is the next step after noticing a difference in conditional probabilities?
Conduct a hypothesis test to check if the difference is due to chance
94
What does a large difference in conditional probabilities indicate?
Stronger evidence that the variables are dependent
95
Why is sample size important in determining dependence?
A large sample size means even small differences can provide strong evidence of dependence
96
P(B|A)
Probability of B given A has already happened
97
Conditional probability
Probability of an event occurring given that another event has already happened. Eg: P(B|A)
98
Product rule used
When probability of both events happening
99
Addition rule used
Probability of either event happening
100
Normal distribution
Unimodal and symmetric and bell shaped
101
Percentile
Percentage of observations that are all below a given data point
102
Z-score
How many standard deviations an observation is from the mean
103
Z score can be used for any distributions but
Only in a normal distribution can be used to calculate percentiles
104
Observations |z| > 2
Are typically considered unusual
105
Z score formula shows
How many standard deviations x is from the mean
106
Empirical Rule
68% of data lies within 1 standard deviation of the mean 95% of data lies within 2 standard deviations 99.7% lies within 3 standard deviations
107
Percentile is the area below
The probability distribution curve to the left of the graph
108
Six sigma
Processes that stay within 6 standard deviations from the mean, ensuring high quality
109
Geometric distribution
Gives the probability that the first success happens on trial K
110
Geometric probability
Gives the chance that the first success happens on the k-th trial
111
Binomial Distribution
describes the probability of getting exactly k successes in n independent Bernoulli trials, where each trial has the same probability p of success
112
Bernoulli Trial
An experiment with two outcomes
113
Central tendency
Describes the center or typical value in a dataset (e.g., mean, median, mode)
114
Variation
Measures how spread out the values are (e.g., range, variance, standard deviation, IQR)
115
If distribution is symmetric
The center is defined as the mean= median
116
If data is extremely skewed
Can transform them into modelling such as the log transformation
117
Skewed data after transformed will be easier to model
As the outliers become less prominent
118
Intensity Map
A map where darker colors represent higher values or frequencies of data
119
Contingency Table
Type of table used to display the frequency distribution of two categorical variables. Shown by organises the variables into rows and columns
120
Bar plot
displays categorical data with bars representing the frequency or value of each category
121
A relative frequency bar plot
shows proportions (percentages) instead of raw counts for each category. It helps compare categories on a relative scale.
122
The difference between bar plots and histograms
Bat plots are categorical and numerical
123
Stacked bar plot
Bars stacked to show counts from different groups
124
Side-by-side bar plot
Bars placed next to each other for comparison
125
Standardized bar plot
Bars show proportions (percentages), not counts
126
Null hypothesis (H₀)
Statement that assumes no effect or no difference
127
Alternative hypothesis (H₁ or Ha)
Statement that claims there is an effect or a difference
128
Point estimate
A single value used to estimate an unknown population parameter
129
Example of point estimate
If you take a sample and calculate the sample mean, that value is a point estimate of the population mean.
130
Margin of error
A range within which true population parameter is expected to lie example : 41% +_ 2.9% so true proportion is between 38.1% and 43.9%
131
Confidence level
Probability that true population parameter lies within the margin of terror eg: 95% confidence
132
error in estimate
refers to the difference between the estimated value (from a sample) and the actual true value (of the population)
133
Bias
systematic error that leads to incorrect or skewed results.
134
Sampling error
natural variation that occurs when you use a sample instead of the whole population
135
Sampling error
natural variation that occurs when you use a sample instead of the whole population. It decreases with larger sample sizes
136
Sampling distribution
how a statistic (e.g., sample mean) would vary if you took many random samples from the population
137
Central Limit Theorem (CLT)
sampling distribution of the sample mean will be approximately normal (bell-shaped) regardless of the population’s original distribution as long as the sample size is large enough
138
The mean of sampling distribution
Equal to population mean
139
Standard error
is the standard deviation of the sampling distribution of a statistic ( sample mean)
140
CLT conditions
1) independence: sample should be random and the population should be at least n<10% than the sample 2) sample size should be large. For proportions at least 10 successes and 10 failures and for mean n>/ 30
141
When u don’t know the true population use the
Sample proportion as best estimate
142
If sample size is small it’s likely
The sampling distribution may be skewed. Normal distribution won’t work.
143
Confidence Interval (CI)
A range of values used to estimate a population parameter (like a mean or proportion) with a certain level of confidence (e.g., 95%)
144
145
If there is a 95% confidence interval
about 95% of those intervals would contain the true population value
146
Hypothesis testing
We examine two competing hypothesis the null and the alternative and based on that data we make a decision on which hypothesis is likely true
147
There’s two types of errors
Type 1 error Type 2 error
148
Type 1 error
You reject the null hypothesis when it’s actually true. You assume there is an effect and there isn’t
149
Type 2 error
Failure to reject null hypothesis when it’s actually false. You think there isn’t an effect but there is
150
There could be errors to rejection
- lack of information - incorrectly rejecting it
151
Type 1 error typically
Set to 0.05
152
Conditions for hypothesis testing
1. Respondents should be independent of each other 2. Sampling should be random 3. The Sample size should be less than 10% of the total population 4. There should be atleast 30 respondents 5. There should be atleast 10 expected successes and 10 expected failures
153
Test static
Tells u how far your sample proportion is from the null hypothesis proportion and measured in standard errors
154
P-value
is the probability of getting results at least as extreme as the observed results, assuming the null hypothesis is true
155
Small p-value
means the observed data is unlikely under the null hypothesis, so you reject H0
156
Two sided hypothesis test
Checking for population parameter difference in both directions
157
One sided hypothesis test
Checking for an increase or decrease in population parameter
158
Point estimate
Single value from sample data used to approximation unknown population parameter
159
Parameter of interest
Specific population characteristic being estimated/ tested
160
Marginal error
Z x SE
161
Goodness of a test ( chi-squared test)
To test whether the observed frequencies in categories match the expected frequencies based on a specific theoretical distribution.
162
Chi square test has one parameter
That’s degree of freedom
163
Degrees of freedom
represent the number of independent values that can vary in a statistical calculation while still satisfying a given constraint.
164
Degrees of freedom influences
Shape, centre and spread of distribution
165
P value
Tail under the chi square distribution
166
Chi square conditions
1. Independence: for each case that contributes a count to the table must be independent for all other cases in the table 2. Sample size: each scenario must have atleast expected cases 3. Df>1 : Degrees of freedom must be greater than 1
167
Not checking the condition
Can affect the tests error rates
168
One sample
Looking for situations where you’re analysing data from a single group
169
Two sample
Comparisons between two separate goods
170
Expected counts
are the frequencies you would expect in each category if the null hypothesis were true.
171
What was the objective of the Friday the 13th Traffic Study?
To compare traffic flow on Friday the 13th versus the previous Friday (6th) at two UK locations.
172
What years did the Friday the 13th Traffic Study cover?
1990 to 1992
173
What was the consistent pattern found in the traffic study?
Traffic volume was consistently lower on Friday the 13th compared to Friday the 6th.
174
What was the largest difference in traffic volume observed?
4382 fewer cars on March 1992 at location 2
175
What statistical distribution was used to analyze the traffic data?
The t distribution, due to small sample size and unknown population standard deviation.
176
What is the null hypothesis (H₀) in the traffic study?
Mean traffic volume on the 6th equals mean traffic volume on the 13th
177
What key assumption must hold for the t-test in the traffic study?
Independence of traffic flow data between the two dates and locations
178
Why use the t distribution instead of normal distribution in small samples?
It accounts for extra uncertainty in the standard error by having thicker tail
179
Does the study prove that people believe Friday the 13th is unlucky?
No, it only shows a difference in traffic volume, not beliefs
180
What does a confidence interval excluding 0 imply about a hypothesis test?
It supports rejecting the null hypothesis (significant difference).
181
What is a correct interpretation of a p-value?
The probability of observing the data (or more extreme) assuming the null hypothesis is true
182
What is an incorrect interpretation of a p-value?
That it gives the probability the null or alternative hypothesis is true
183
Are reading and writing scores from the same students independent?
No, they are paired since the same students took both tests
184
What is the parameter of interest for paired data?
The average difference in scores between reading and writing for all students
185
What conclusion if p-value > 0.05 in paired test?
Fail to reject H₀; no convincing evidence of difference in average scores
186
What is tested in the diamond price example?
Whether 1-carat diamonds have higher price per point than 0.99-carat diamonds
187
What was the conclusion from the diamond price test?
1-carat diamonds have significantly higher prices per point; 0.99-carat diamonds may be better value
188
What does linear regression model?
Relationship between one response variable (y) and one explanatory variable (x)
189
What is a residual in regression?
Difference between observed and predicted values (residual = observed − predicted)
190
What does the correlation coefficient (r) measure?
The strength and direction of the linear relationship between two variables (-1 to +1)
191
What is the formula for a regression line?
^ y = b0 + b1 x where b0 is intercept and b1 is slope
192
What are key conditions for linear regression?
Linearity, nearly normal residuals, constant variability (homoscedasticity).
193
How is slope interpreted in context?
The expected change in y for a one-unit increase in x
194
What does R² represent?
Proportion of variability in y explained by the model (e.g., 0.56 means 56%)
195
What is the difference between prediction and extrapolation?
Prediction uses x-values within observed range; extrapolation predicts outside the range (less reliable)
196
What is multiple linear regression?
A model predicting y from multiple explanatory variables (x₁, x₂, ...).