Final Of Everythin Flashcards

1
Q

Data Matrix

A

A convenient way to store data (eg spread sheet, table). Each row is a unique case (observational unit). Each column corresponds to a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The two types of variables

A

Numerical or Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Numerical Variables

A

Can be discrete or continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Categorical Variables

A

Can be ordered or nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of variable is “Number of Siblings”?

A

Numerical (discrete)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of variable is “Student Height”?

A

Numerical (continuous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of variable is “Previous Stats Courses Taken”?

A

Categorical (nominal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explanatory variables might affect

A

Response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Two types of data collection

A

Observational Studies and Experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Researchers collect data passively they merely observe

A

Observational studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Researchers actively control the data collection trying to establish causation

A

Experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Sampling principles and strategies

A

1st step: Identify topics and questions to be investigated
2nd: clearly laid out research questions is important to identify important subjects/causes and what variables are important
3rd: Consider how data are collected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Example: suppose we want to estimate household size where a household is defined as people living together in the same dwelling and sharing living accommodation. If we selected students at random at an elementary school and asked them what their family size is, wilk this be a good measure of house hold size

A
  • Average will be biased
  • Only measuring households with children, not single people or people without children.
  • Would likely estimate a higher number than the true number.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Relationship between Sample and Population

A

Sample is a subset of population:
Population- people
Sample- a group of selected people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Three sampling methods

A

1) simple random sample
2) stratified sample
3) cluster sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Simple random sample

A

Randomly selected from population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What type of sample is cars passing through intersections in Kelowna

A

Simple random sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Stratified sample

A

Cases grouped into strata, then simple random sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Cluster sample

A

Divide into clusters and sample all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Multistage sampling

A

Clusters are sampled randomly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Scatterplot

A

A way to provide case by case view of data. Can visualize relationship between two numerical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Dot plot

A

Visualize one numerical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Sample mean (sample average formula)

A

x̄ = (x1 + x2 + x3 +… +xn)/n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the unit of sample mean

A

The same as the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Symbol for population mean
μ
26
Histograms
Provides a view of the data density (ie the data distribution)
27
Unimodal histogram distribution
A single prominent peak
28
Bimodal/ multimodal histogram distribution
Several prominent peaks
29
Uniform histogram distribution
No apparent peaks
30
Types of skewness
Right skewed (tail on right), left skewed (tail on left) or symmetric
31
Deviation
Distance from the mean
32
Sample variance
S^2 = ((x1- x̄)^2 + (x2-x̄)+…+(xn-x̄)^2)/(n-1)
33
What are the units of sample variance?
Squared of the units of the sample
34
Sample standard deviation formula
S =sqrt(s^2)
35
Population variance formula
σ^2 = ((x1-x̄)^2 +… (xn-x̄)^2)/n
36
Population standard deviation
σ = sqrt (σ^2)
37
Main components of a box plot
- Median Q2 - First quartile Q1 (median of half) -third quartile Q3 (median of other half) -Max and min wiskers Q3 + 1.5*IQR and Q1-1.5*IQR - IQR is Q3-Q1
38
IQR formula
Q3-Q1
39
Steps to draw a box plot
1) Draw a thick line for the median (Q2) 2) Draw rectangle with bounds Q1 and Q3 3) Draw a dotted line for Q1-1.5IQR and Q3+1.5IQR 4) Label outliers and draw T shape upper/lower whiskers ( they only go as far as highest or lowest data points)
40
Robust Statistics
Median and IQR are more robust than mean and standard deviation (less affected by outlier behavior)
41
Common practices
-Symmetric distributions-> mean and SD -Skewed distributions -> median and IQR
42
What type of plot would be most useful for visualizing the data density
Histogram
43
Suppose a data set only has two values. What can you say about the relationship between mean and median?
Mean= median
44
Consider a population of [1,2,3,4,10]. What are three mean and variance (VAR)?
Mean =4 Var=10
45
Consider a population of [1,2,3,4,10]. What are three mean and variance (VAR)?
Mean =4 Var=10
46
A company records the commute distances of all 42 of its employees. By mistake the smallest commute was measured at 1 mile instead of 10. compre recorded median to actual median
The recorded median will be the same as the actual median
47
Suppose we are interested in estimating the malaria rate known as a dense tropical portion of a southeastern country. We learn there are 30 villages, each more or less similar to the next. Our goal is to test 150 individuals. What sampling method should be used
Cluster sampling
48
What are the odds of rolling a 1 with a fair dice
1/6
49
Probability Definition
The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times
50
Mutually exclusive or disjoint
Have no outcomes in common
51
Outcome
Random result from an experiment
52
Event
Set of outcomes has probability assigned to it
53
Sample space
All possible outcomes
54
Complement
Probability that the event does not occur
55
There are 18 balls in a box. Five are white, thirteen are black. Choose two balls at random, on after another find the probability that both chosen balls are white
20/306
56
A fair coin is flipped twice what is the probability at least one flip is tails
3/4
57
Twenty students including Miriam and Rachel are to be placed in four classes of equal size at random. What is the probability they end up in different classes?
15/19
58
If two events are independent then P(A|B) = P(B|A)
No
59
Random Variable
An assignment of numbers to outcomes in some sample space
60
Dataset
Mean and variance
61
Random variable
Expected value (similar to mean) and variance.
62
Expected value equation
E(x) = x1 P(x=x1) + x2 P(x=x2) +…+ xn P(x=n)
63
Expected value symbol
E(x) or μ
64
Variance of Random Variables (RV)
Var(x) = (x1-μ)^2 p(x=x1) +…+ (x2-μ)^2P(x=xn)
65
Variance of X symbols
Var(x) or σ^2
66
Standard deviation notiation
SD(x) or σ
67
E(ax)=
aE(x)
68
E(ax+b)
aE(x) +b
69
SD(ax) =
|a| SD(x)
70
SD(ax+b) =
|a| SD(x)
71
Var(ax) =
a^2 Var(x)
72
Dependent Events probability notation
P( A n B) = P(B) P(A|B)
73
Independent events probability notation
P(A n B) = P(B) * P(A)
74
Area under the gaussian curve
Area = 1
75
Normal distribution Parameter notation
N( μ, σ) Mean- μ Standard deviation- σ
76
What is a Z score
A z score does the conversion to N(0,1) A z score is a way to describe the relationship of a value to the mean of a group of values
77
Z score formula
z = (x- μ)/σ
78
Quantile
A quantile os an equal distribution of the probability distribution eg quartile 4 groups, percentile (100 groups)
79
Q-Q plot of symmetric distribution
Straight line following y=x
80
Q-Q plot of T shaped distribution
Starts lower than the line y=x then meets the line at the origin then slowly goes above the line.
81
Q-Q plot of a right skew distribution
Concave up curve, curve points right
82
Q-Q plot of left skew data
Concave down curve, pointing left
83
Geometric distributions
- goes until something happens (ie successful outcome) - a series of independent trials with two outcomes
84
Binomial distribution:
- # of successes in a set # of trials -two variables success or failure
85
4 conditions of binomial distribution
1) trials are independent 2) # of trials, n, is fixed 3) each trial is success or failure 4) probability of success, p, is same for each trial.
86
Confidence Intervals
A confidence interval is the range of values to which we are a certain percentage confident (95%) that pur sample measurement represents the actual population mean.
87
Point estimate
A point estimate is the calculation of a single value which is the best guess as to the population parameter which is unknown (eg mean, proportion in support of a statement)
88
Population proportion notation
P
89
Sample proportion notation
90
Central limit theorem
-When many sample means are taken, the distribution of these sample means look like a normal distribution (particularly for larger sample sites) - The populations distribution (even when skewed) does not actually change this normal distribution appearance of the sample means.
91
How large is large enough when it comes to sample size?
Generally n= 30
92
Success failure condition
np>= 10 and n(1-p) >=10
93
95% confidence interval of containing the mean
Point estimate +- 1.96 *SE
94
Standard Error SE
SE = σ/sqrt(n) σ- population standard deviation n- sample size
95
The 95% confidence interval means:
Roughly 95% of the time, the interval sample mean +- 1.96 σ/sqrt(n) will contain the population mean
96
99% confidence interval
Point estimate +- 2.58 σ/sqrt(n)
97
Consider the case for finding confidence intervals without population standard deviation
1. Use sample SD instead of population SD 2. Use t-tables instead of z table
98
t formula
t = (x̄- μ)/(s/sqrt(n))
99
Proof by contradiction
If the prob is very small we should reject the claim and accept our conjecture. Either you are observing a rare event or something is wrong about the original claim
100
Four steps of proof by contradiction
1) state hypothesis: - null hupothesis Ho : μ = - alternative hypothesis Ha: μ … 2) compute z score from the sample mean 3) find the pvalue: area to the right of z score 4) make the decision: - reject the null hypothesis and accept alternative or accept null hypothesis based on alpha value
101
When do you use z tables
Population SD is given and you are trying to estimate population mean
102
When do you use t tables
Population SD is not given and you are trying to estimate population mean
103
When do you use chi squared tables (X^2)
Population SD is not given and you are trying to estimate population variance.
104
Using chi squared tables
- Examine a row for distributions with degree of freedom -Identify a range for the area (eg 0.025 to 0.05) -Chi squared table provides upper tail values which is different than z- and t distribution tables
105
Population variance confidence interval
[ (n-1)s^2/x2^2 , (n-1)s^2/x1^2]
106
What is instrumentation?
Term to describe the instruments used to measure physical quantities eq, pressure temperature, voltage
107
Active instruments
-Have external power - expensive (complicated) - resolution can be very small
108
Passive instruments
-Do not have external power - inexpensive (simple) -resolution is limited
109
Null type instruments
-No display - null pressure gauges have weights coming on/off to measure pressure( cumbersome) weights are balanced until reference mark is reached.
110
Deflection type instruments
-Display, -previous pressure gauges conveniently has a pointer against a scale
111
Analog instruments
- has output vary continuously. Resolution is determined by what your eye can distinguish
112
Digital instruments
-Has discrete steps in resolution - requires analog to digital converter (A/D) -Expensive - Slow, not good for fast processes
113
Smart instruments
Has a microprocessor
114
Non-smart instruments
Does not have a microprocessor
115
Inaccuracy
The extent to which a reading might be wrong and is often quotes as a percentage of the full scale(f.s) (max value) reading of an instrument.
116
Tolerance
Describes the maximum deviation of a manufactured component from some specified value.
117
Range or span
Defines the min and max values of quantity that instruments can measure
118
Threshold
The input will have to reach a certain level before then change in the instruments output is large enough to be detectable
119
Resolution
The lower limit on the magnitude of change in the input measured quantity that produces an observable change in the instrument output.
120
Nonlinearity
Maximum deviation of any of the output readings marked x from the straight line.
121
Linear-regression line
Estimate of y is z-score = R any given x in z-score
122
Linear regression line:
(ŷ-ȳ)/sy = R (x-x̄)/sx ȳ- ave pf y data points ŷ- the line of best fit
123
Sensitivity
-Slope - the measure of change in instrument output that occurs when the quantity measured changes by a given amount -scale deflection/ value of measurement producing deflection
124
Zero drift
Bias zero reading of instrument modified by change in ambient conditions
125
Sensitivity drift
- slope (ie sensitivity) drifts because of change in ambient conditions -eg modulus of elasticity in spring changing as a function of temperature
126
Sensitivity drift coefficient
= sens drift/ change in environment
127
Zero drift coefficient
= Zero drift/ change in environment
128
Reasons for incorrect or inaccurate measurements
- Behavior will gradually dicerge from the stated specifications -effects of dust dirt fumes and chemicals in the environment
129
Several factors impacting rate of divergence
1. type of instrument 2. the frequency of usage 3. severity of the operating conditions
130
Systemic errors
Mean is wrong (accuracy)
131
Random errors
Standard deviation is large (precision)
132
What are static characteristics
1) linearity 2) tolerance 3) sensitivity
133
What is the strength of a linear fit model?
The R squared value
134
How many outcomes in a Bernoulli trial.
There can only be 2 outcomes
135
Expected number and SD given probability
1/p - expected value Sqrt( (1-p)/p^2) - standard deviation