Statisitics Flashcards

1
Q

Probability

A

A mathematical tool to study randomness, dealing with the likelihood of an event occurring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Statistics

A

The science that deals with the collection, analysis, interpretation, and presentation of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Descriptive Statistics

A

Organizing and summarizing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inferential Statistics

A

Drawing conclusions from data using formal methods to determine our confidence level of those conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Population

A

A collection of persons, things, o objects under study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sample

A

A subset of a population that are studied directly to gain information about the larger populaiton.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Statistic

A

A number that represents a property of a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parameter

A

A numerical characteristic of a population that can be estimated by a statisitc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Representative Sample

A

A sample that accurately represents the parameters of the whole population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variable

A

A characteristic or measurement that can be determined for each member of the population.

Typically denoted as X or Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Numerical Variable

A

A variable with units of equal weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Categorical Variables

A

Variables that identify a category that the object is in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data

A

The values of a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Datum

A

A single value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Qualitative Data

A

The result of categorizing or describing attributes of a population. AKA categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Quantitative Data

A

Numbers. The result of counting or measuring attributes of a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Quantitative Discrete Data

A

Data that is measured on a scale that has a finite number of values within a finite interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Quantitative Continuous Data

A

Data measured on a scale that has an infinite number of values within a finite interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Pie Chart

A

A graph in which categories of data is represented wedges of a disk and are proportional in size to the percent of individuals in each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Bar Graph

A

A graph in which the length of the bar is proportional to the number of individuals in each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Pareto Chart

A

A bar graph in which bars are ordered from largest to smallest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Random Sampling

A

A sampling method in which each individual has an equal chance of be selected for the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Simple Random Sample

A

A random sampling method in which any group of n individuals is equally likely to be chosen as any other group of n individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Stratified Sample

A

A sample obtained by divide the population into groups called strata and then taking a proportionate number from each stratum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Cluster Sample
A sampling method by which one divides the population in clusters or groups and then randomly selects some of the clusters.
26
Systematic Sample
A sampling method in which a starting point is chosen at random and then every nth piece of data from the population is added to the sample
27
Convenience Sampling
A non-random method of sampling that involves takes the data that is readily available.
28
Sampling with Replacement
Involves the member that has been chosen to go back into the population. This allows for the possibility of being chosen more than once.
29
Sampling without Replacement
When a member of a population can only be chosen once.
30
Sampling Errors
Errors in data resulting from the sampling process such as too small of a sample size
31
Nonsampling Errors
Errors in data not resulting from the sampling process such as a defective counter.
32
Sampling Bias
Created when some members of a population are more likely to be chosen than other members.
33
Level of Measurement
The way a set of data is measured
34
Nominal Scale
Used to measure qualitative data. These are categories are not ordered in any way
35
Ordinal Scale
Similar to the nominal scale, it categorizes. But unlike the nominal scale, it is able to order the data.
36
Interval Scale
A measuring scale that has a definite ordering, ability to measure and calculate the difference in data points, and does not have a starting point
37
Ratio Scale
A quantitative measuring scale in which there is a starting point (0), and ratios can be calculated between data points
38
Frequency
The number of times a value of the data occurs
39
Relative Frequency
The ratio of the frequency of a particular data point to the total number of outcomes.
40
Cumulative Relative Frequency
The sum of all previous relative frequencies.
41
Explanatory Variable
The variable that causes a change in another. AKA independent variable.
42
Response Variable
A variable that changes as a result of a change in the explanatory variable. AKA dependent variable.
43
Treatments
The different values of the explanatory variable
44
Experimental Unit
A single object or individual to be measured
45
Lurking Variables
Additional variables that can cloud a study
46
Random Assignment
Refers to randomly assigning the experimental units to the treatment groups.
47
Control Group
A group that is given a placebo treatment in which the treatment cannot influence the response group
48
Blinding
When a person involved in a research study does not know who is receiving the active treatments and who is receiving the placebo
49
Double Blind Experiment
A research study in which both the researchers and the subjects are blinded
50
Descriptive Statistics
An area of statistics concerned with displaying data through numerical and graphical ways.
51
Stem-and-Leaf Graph or Stemplot
A two column table, ['stem', 'leaf'], with the leaf being the data point's final significant digit and the stem being the rest of the digits. The rows are in descending order from least to greatest.
52
Outlier
An observation of data that does not fit the rest of the data. Sometime called an extreme value.
53
Line Graph
A graph that uses the x-axis to plot one variable and the y-axis to plot another variable. Line segments are used to connect each point.
54
Bar Graphs
A graph that uses bars to display the magnitude of the data.
55
Histogram
A graph that consist of adjoining boxes. The horizontal axis is labeled with eh data it represents while the vertical axis is labeled with either the frequency or relative frequency.
56
Frequency Polygon
A line graph with the data on the x axis and the frequency on the y axis
57
Time Series Graph
A graph with time on the horizontal axis and the data on the vertical axis
58
Quartiles
Measures of location on the horizontal axis. Q1 (25%), Q2 (50% or median), Q3 (75%). Divides ordered data into quarters.
59
Percentiles
Divides ordered data into hundredths.
60
Median
The center of the data. If the number N of data points is even, then the median is the average of the two values closest to the N/2. If it odd, then it is the value of the ((N-1)/2)+1 data point.
61
Interquartile Range (IQR)
The spread between the first and third quartile. IQR = Q3-Q1
62
Box Plots or Box-Whisker Plots
Gives a good image of the concentration of data. Constructed with the minimum value, Q1, the median (Q2), Q3, and the maximum value. The min/max are the endpoints of of the axis, Q1 marks the edge of the box closest to the min and Q3 marks the edge of the box closest to the max. |--------|=====|====|----------| min
62
Box Plots or Box-Whisker Plots
Gives a good image of the concentration of data. Constructed with the minimum value, Q1, the median (Q2), Q3, and the maximum value. The min/max are the endpoints of of the axis, Q1 marks the edge of the box closest to the min and Q3 marks the edge of the box closest to the max. |--------|=========|====|-----------------------| min Q1 Median Q3 max
63
Mean
The sum of N data points divide by N
64
Median
The value of the data point in set of N data points with index of (N+1)/2 if odd or (V[N/2]+V[N/2+1])/2 if N is even
65
Mode
The most frequent value
66
The Law of Large Numbers
The limit of the sample mean as sample size approaches population size is population mean
67
Sampling Distribution
The distribution of frequencies of a range of different outcomes that could occur for a statistic of a population
68
Symmetrical Distribution
Occurs if a vertical line can be drawn at some point for which the image of the left side will mirror the image to the right side
69
Skewed to the Left
When a distribution of data is biased towards the left side of the mode.
70
Skewed to the Right
When the distribution of data is concentrated on the right side of the mode
71
Standard Deviation
A widely used measure of variation. A number that measures how far data is from the mean. value = mean + (#ofSTDEV)(standard deviation)
72
Deviation
If x is a measured value of a data point in a data set with a mean M, then deviation = x-M
73
Variance
The average of the squares of deviations x – x̄ for sample x - μ for population
74
Standard Deviation Formula
Sample: s = √(∑(x - x̄)²/(n-1)) Population:  σ = √(∑(x - μ)²/Ν)
75
Standard Error of Mean
The standard deviation of the sampling distribution of the mean σ ∕ √n σ is the standard deviation of the population n is the sample size
76
Sampling Variability of a Statistic
A measure of how much a statistic varies from one sample to another
77
Z-Score
The number of standard deviations from the mean
78
Chebyshev's Rule
For any data set: >75% of data is within 2 standard deviations form the mean >89% of data is within 3 standard deviations of the mean >95% of data is within 4.5 standard deviations of the mean
79
Empirical Rule
For data with Bell-shaped or Symmetric distribution: ~68% of data is within 1 standard deviation from the mean ~95% of data is within 2 standard deviations form the mean >99% of data is within 3 standard deviations of the mean
80
Probability
A measure associated with how certain we are of outcomes of a particular experiment or event
81
Experiment
Planned operation carried out under controlled conditions
82
Chance Experiment
An experiment in which the result is not predetermined
83
Event
Any combination of outcomes. Represented by uppercase letters like A or B The probability of an event A is written P(A)
84
Outcome
The result of an experiment
85
Sample Space
The set of all possible outcomes of an experiment
86
Equally Likely
Each outcome of an experiment occurs with equal probability
87
Law of Large Numbers
As the number of repetitions of an experiment is increased, the relative frequency (# times of a particular outcome/# of total outcomes or repetitions) obtained in the experiment tends to become close and closer to the theoretical probability
88
Empirical
Often used in place of the word observed. Observed result = empirical result
89
OR Event
The outcome is in the event A OR B if the outcome is in A or is in B or is in A and B
90
AND Event
An outcome is in A AND B if and only if the outcome is in both A and B
91
Complement
All the outcomes that are not in an event A, denoted as A' (A prime).
92
Conditional Probability
The conditional probability of A given B is probability that an event A occurs given an event B has already occurred. P(A|B) P(A|B) = P(A AND B) / P(B)
93
Independent Events
Two events are independent if the occurrence of one event does not affect the chance that the other occurs. If the following are True: - P(A | B) = P(A) - P(B | A) = P(B) - P(A and B) = P(A)P(B)
94
Mutually Exclusive Events
Events that cannot occur at the same time. P(A AND B) = 0
95
Multiplication Rule
If A and B are two events are defined on a sample space: P(A AND B) = P(B)P(A | B)
96
Addition Rule
If A and B are defined on a sample space: P(A OR B) = P(A) + P(B) - P(A and B)
97
Contingency Table
A table consisting of at-least two rows and two columns that shows the observed frequency of two variables.
98
Tree Diagram
Consists of nodes and branches. Each node represents the probability of an event.
99
Venn Diagram
A box that represents the sample space, and circles/ellipses that represent the individual events. The overlap of circles represents a common outcome between two events.
100
Random Vairable
A variable that describes the outcomes of a statistical experiment in words. Denoted by upper case letters like X or Y. Lower case letters like x or y denote the value of the variable. Example: X = the number of heads you get when you toss three fair coins x = 0,1,2,3
101
Discrete Probability Distribution Function (Discrete PDF)
Has two characteristics: 1. Each probability is between zero and one, inclusive. 2. The sum of the probabilities is one.
102
Expected Value
AKA long-term average or mean The average value that is expected when an experiment is repeated over the log-term Denoted by μ μ = Σ (x · P(x))
103
Standard Deviation of a Probability Distribution
Denoted by σ. σ = SQRT( Σ[ (x - μ)^2 · P(x) ]) The square root of the sum of the variances squared times the probability
104
Binomial Experiment
1. Fixed number of trials, denoted by n 2. Only two possible outcomes for each trial: "success" and "failure". The letter p denoted the probability for success on one trial and q represents the probability of failure for one trial. p + q = 1 3. The n trials are independent and are repeated using identical conditions.
105
Binomial Probability Distribution
The outcomes of a binomial experiment The random variable X = the number of successes in the n independent trials The mean μ = np The variance σ² = npq The Standard Deviation = √npq
106
Bernoulli Trial
A binomial experiment win which n=1.
107
X̴̴~B(n,p)
X is a random variable with a binomial distribution. B = binomial probability Distribution Function with parameter n = number of trials and p = probability of success on each trial
108
Geometric Experiment
1. One or more Bernoulli trials with all failures except the last one. In other words, Trials are repeated until a success. 2. The number of trials must be greater than 0 but has not limit. 3. Each trial has the same p and q. X = # of trials until first success.
109
Geometric Probability Distribution Function
X~G(p) X is a random variable with a geometric distribution. p = the probability of a success for each trial
110
Hypergeometric Experiment
1. Take samples form two groups 2. Concerned with a group of interest, "the first group" 3. Sample without replacement 4. Each pick is not independent, given the sampling is done without replacement 5. These are not Bernoulli Trials X = # of items from the group of interest
111
Poisson Experiment
1. Poisson probability distribution gives the probability of a number of events occurring in a fixed interval of time or space if these events happen with a known average rate and independently of the time since the last event. 2. The Poisson distribution may be used to approximate the binomial if the the probability of success is "small" (such as .01) and the number of trials is "large" (such as 1,000) X = the number of occurrences in the interval of interest
112
Poisson Probability Distribution Function
X~P(μ) X is a random variable with a Poisson distribution The parameter μ (or λ) = the mean for the interval of interest. THe standard deviation of the Poisson distribution with mean μ is Σ =√μ
113
Probability Density Function (pdf)
The function f(x) that represents a continuous probability distribution.
114
Cumulative Distribution Function (cdf)
Measures the area under the curve. 1. Outcomes are measured, not counted 2. The area under the curve is equal to 1 3. Probabilities are found for intervals of x rather than individual values of x 4. P(c < x < d) 5. P( x = c) = 0 6. P( c < x < d) = P(c <= x <= d)
115
Exponential Distribution
Often concerned with the amount of time until an event occurs. Typically has greater numbers of small values and lesser numbers of large values.
116
Memoryless Property
P(X > r + t | X > r) = P(X > t) for all r, t >= 0 The probability of an event happening in time t given that an amount of time r has past is the same as the probability of an event happening in time t regardless of the amount of time past.
117
Conditional Probability
The Likelihood that an event will occur given that another event has already occurred.
118
Decay Parameter
Describes the rate at which probabilities decay to zero for increasing values of x. It is the value m in the probability denity function f(x) = me^(-mx) m=1/μ μ = mean of the random variable
119
Uniform Distribution
A continous rand variable that has equally likely outcomes over the domain, a
120
Standard Normal Distribution
A normal distribution of standardized values called z-scores. X~N(μ,σ)
121
Z-Scores
Measured in units of standard deviation. z = (x - μ) / σ
122
Normal Distribution PDF
f(x) = (1 / (σ•√(2π))•(e^(-1/2((x-μ)/σ)²))
123
Central Limit Theorem
If repeated sampling of large enough sizes n, and each sample's mean is calculated, the histogram created from those means will approximate a normal bell shape.
124
The Central Limit Theorem for Sums
If one repeatedly draws samples of a given size and calculates the sum of each sample, the sums will follow a normal distribution. The normal distribution has a mean equal to the original mean multiplied by the sample size . The normal distribution has a standard deviation equal to the original standard deviation multiplied by the square root of the sample size.
125
Inferential Statisitics
Part of statistics concerned with using sample data to make generalizations about an unknown population.
126
Point Estimates
Single values used to estimate a parameter within a population.
127
Confidence Interval
An interval of numbers that a parameter will fall in with a given probability. (point estimate - margin of error, point estimate + margin of error)
128
Hypothesis Test
Collecting data from a sample and evaluating the data. 1. Set ip two contradictory hypotheses 2. Collect sample data 3. Determine the correct distribution to perform the hypothesis test 4. Analyze sample data by performing calculations that will ultimately allow you to reject or decline to reject the null hypothesis 5. Make a decision and write a meaningful conclusion
129
Null Hypothesis
Denoted as H₀ A statement of no difference between the variables - they are not related.
130
Alternative Hypothesis
Denoted as Hₐ A claim that contradicts the null hypothesis. The hypothesis that the researcher is trying to prove
131
Hypothesis Testing Outcomes
1. Do not reject null / null is true 2. Reject the null / null is true 3. Do not reject null / null is false 4. Reject the null / null is false
132
Type I Error
Rejecting the null when it is actually true
133
Type II Error
Accepting the null when it is actually false
134
P(Type I Error)
Probability of a type I error. Denoted as α
135
P(Type II Error)
Probability of a type II error Denoted as β
136
Power of the Test
The probability of rejecting the null when it is false
137
P-Value
The calculated probability of getting the test result
138
Rejecting or Not Rejecting Null Hypothesis
1. if α > p-value, reject null. | 2. if α ≤ p-value, do not reject null
139
Independent Groups
Sample groups that are independent from each other
140
Matched Groups
Two samples that are dependent on each other
141
Standard of Error
An estimated standard deviation for a hypothesis test of the difference in sample means sqrt( (s_1)^2/n_1 + (s_2)^2/n_2 )
142
Chi-Square Distribution
X~X²_df df is the degrees of freedom  μ = df σ = sqrt( 2(df) )
143
Student's t-distribution
- the graph is similar to the standard normal curve - the mean is zero and the distribution is symmetric about zero - has more probability in its tails than a standard normal distribution
144
Degrees of Freedom
The number of independent pieces of information needed to calculate a statisitic
145
Goodness=of-Fit Test
∑ₖ(O - E)² / E ``` O = observed values E = expected values k = number of different data cells or categories ```
146
Multivariate
Data containing multiple variables
147
Linear Regression
the process of fitting the best-fitting line
148
Error of Residual
The difference between the y value of a data point at x and the regression line at x.
149
Sum of Squared Errors (SSE)
∑𝜀²
150
Correlation Coefficient r
𝑟= (𝑛𝛴(𝑥𝑦)−(𝛴𝑥)(𝛴𝑦)) / √[[𝑛𝛴𝑥²-(𝛴𝑥)²][𝑛𝛴𝑦²−(𝛴𝑦)²]] -1 <= r <= 1 Values of r closer to -1, 1 indicates a stronger linear relationship between x and y
151
Coefficient of Determination
r² The square of the correlation coefficient Represents the percent of variation in the dependent variable y that can be explained by variation in the independent variable x using the regression line.
152
Significance of the Correlation Coefficient
A hypothesis test to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population
153
Population vs Sample Correlation Coeffecient
 ρ = population correlation coefficient ρ is unknown r = sample correlation coefficient r is known, calculated from the sample data
154
Significance Level
The probability of rejecting a null hypothesis when it is true
155
Outliers
Observed data points that are far from the least squares line Usually identified as being further than two standard deviations from the best-fit line s = sqrt( SSE / n-2 ) s is the standard deviation SSE = sum of squared errors n = number of data points
156
Analysis of Variance (ANOVA)
Used to determine the existence of a statistically significant difference among several group means Assumptions: 1. Each population from which a sample is taken is normal 2. All samples are randomly selected and independent 3. The populations are assumed to have equal SD or variances 4. The factor is a categorical variable 5. The response is a numerical variable