Midterm Flashcards

(65 cards)

1
Q

Categorical Variables

A

These variables represent categories or groups and cannot be used in mathematical operations like addition or subtraction to derive meaningful results. They are qualitative in nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ordinal Variables

A

These have a natural order or ranking, but the differences between categories are not quantifiable or uniform. For example:
Shift Sizes: Small, Medium, Large. While there is an order, the difference between Small and Medium may not be the same as between Medium and Large.
Tax Brackets: Low, Medium, High. The brackets are ordered, but the difference between Low and Medium is not necessarily the same as between Medium and High.
Importance: Ordinal variables are useful in surveys and rankings where relative positioning matters, but exact differences do not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Nominal Variables

A

These represent categories without any inherent order. They are essentially labels or names.
Yes or No: Binary responses like “Yes” or “No” are nominal.
Colors: Red, Blue, Green, etc. These are different but cannot be ordered.
Types of Animals: Dog, Cat, Bird. These are distinct categories without any natural order.
Importance: Nominal variables are crucial in classification tasks where the goal is to group data into distinct categories without implying any hierarchy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Numerical Variables

A

These variables represent quantities and can be used in mathematical operations. They are quantitative in nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Discrete Variables

A

These take on specific, separate values, typically integers. They do not vary continuously.
Number of Times Done: For example, the number of times a person has visited a doctor. This is a count and can only be a whole number.
Importance: Discrete variables are essential in counting and frequency analysis, where the focus is on the number of occurrences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Continuous Variables

A

These can take on any value within a continuous interval, including fractions and decimals.
Height: A person’s height can be measured to any degree of precision.
Concentration: The concentration of a chemical in a solution can be measured with high precision.
Importance: Continuous variables are vital in measurements and modeling where precision is required, such as in scientific experiments and engineering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Independent Variable (Predictor)

A

This is the variable that you believe may influence or cause changes in the response variable. It is the “input” in an experiment or study.
Importance: Identifying the independent variable is crucial for designing experiments and understanding causal relationships.
“INPUT”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Dependent Variable (Response)

A

This is the variable that you are trying to explain or predict. It is the “output” in an experiment or study.
Importance: The dependent variable is the focus of analysis in many studies, as it represents the outcome of interest.
“OUTPUT”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Causal Relationship

A

This is a relationship where changing the independent variable directly affects the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Observational Studies

A

These cannot establish causal relationships because they do not involve controlled manipulation of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Experimental Studies

A

These can establish causal relationships by manipulating the independent variable and observing the effect on the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Correlation vs. Causation

A

It is important to remember that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sampling

A

Is the process of selecting a subset of individuals from a larger population to estimate characteristics of the whole population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Population

A

The entire set of elements you are interested in studying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sample

A

A subset of the population that you collect data from. The goal is for the sample to be representative of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parameters

A

Numerical values that describe some characteristic of a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Statistics

A

Numerical values that describe some characteristic of a sample.

**The goal is to use statistics to estimate parameters.

Importance: Sampling is essential because it is often impractical or impossible to study an entire population. A well-chosen sample can provide accurate estimates of population parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Random Sampling

A

Each member of the population has an equal chance of being selected. This helps avoid bias and ensures that the sample is representative.
Importance: Random sampling is crucial for generalizing results from the sample to the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Avoiding Selection Bias

A

Ensuring that the sample is not systematically overrepresented or underrepresented. This includes eliminating convenience bias, where samples are chosen based on ease of access.
Importance: Selection bias can lead to incorrect conclusions, so it is vital to ensure that the sample is representative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Sample Size

A

The sample should be large enough to minimize sampling error but not so large that the marginal gain in accuracy is negligible.
Importance: An appropriately sized sample ensures that the results are reliable and valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Coverage of Key Demographics or Features

A

Ensuring that all relevant subgroups are included in the sample proportionally.
Importance: This ensures that the sample accurately reflects the diversity of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Randomized Allocation

A

Randomly assigning subjects to different groups to ensure that each group is comparable. This is crucial for establishing causal relationships.

Importance: Randomized allocation helps eliminate confounding variables and ensures that any observed effects are due to the independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Confronting Variance/Counfounding Variable

A

** Influences both the Predictor and Response Variables

Variance can arise from differences in subjects’ backgrounds, environments, etc. Methods to control variance include:

Matching: Pairing subjects based on similar characteristics to control for differences.
Replication: Using a larger sample size to increase the accuracy of the results.
Blocking: Grouping subjects into blocks based on certain characteristics and then randomizing within each block.
Importance: Controlling variance is essential for ensuring that the results of an experiment are valid and reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Hypothesis Testing

A

A statistical method used to determine whether there is enough evidence to reject a null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Null Hypothesis (H₀)
The hypothesis that there is no effect or no difference. It is assumed to be true until evidence suggests otherwise.
26
Alternative Hypothesis (H₁)
The hypothesis that there is an effect or a difference
27
p-value
Probability of observing data w/ the "weakest evidence" in favor of Ha assuming Ho is true. A low p-value (typically < 0.05) suggests that the null hypothesis should be rejected.
28
Type I Error (False Positive)
Rejecting the null hypothesis in favor of Ha, when Ho is actually true. This is determined by the significance level (α). Importance: Controlling Type I errors is crucial to avoid false conclusions.
29
Type II Error (False Negative)
Failing to reject the null hypothesis when it is actually false. This is influenced by the sample size and the effect size. Importance: Minimizing Type II errors is important to ensure that real effects are not missed.
30
Analytical Hypothesis Testing
Analytical methods are used when the data meets certain conditions, such as being normally distributed or having a sufficiently large sample size. Importance: Analytical methods provide precise calculations of probabilities and are essential when the conditions are met. However, if the conditions are not met, simulation methods may be more appropriate.
31
What is the order for Simulation Hypothesis Testing
1) Assume Ho initially and take the general probability (this is done by shuffling the predictor variable to mimic natural variation *hold one column and shuffle the other* 2) Simulate multiple trials to get different proportion values 3) plot # of trials vs (sample/test statistic) (mean1-mean2) 4) Find the p-value (which is the probability of observing data with the "weakest evidence" in favor of Ha assuming Ho is true) *find from prob of finding a value >= to the test statistic 5) If p-value < preset threshold (alpha), Reject Ho in favor of Ha 6)*alpha is known as the significance level*
32
what does the significance level represent
Represents the rarity of our test statistic to reject Ho in favor of Ha (difference in the proportions or means) that we use to compare to the p-value 0.05
33
What is the Central Limit Theorem
States that the distribution of samples means will approximate a normal distribution (Gaussian) as the sample size increases, regardless of the original population's distribution
34
What is the Gaussian curve
normal distribution/bell curve with symmetry and "common" standard dev zones
35
What is the Poisson distribution
A discreteb(not continuous) probability distribution that models the number of events (independent of one another) occurring in fixed intervals given a constant rate of occurrence
36
How does the Poisson Distribution differ from the normal distribution
Poissons counts the number of events and is discrete whreas the normal distribution is continuous and counts the range of mean values
37
What is a z-test?
* St. Dev is known so accurate *difference in the two proportions A z-test compares a sample mean (or proportion) to a population parameter under the assumption that the population standard deviation (σ) is known (or the sample is large enough to estimate σ very accurately). Narrow Tall curve
38
What is a t-test?
* St. Dev of population is unknown so an estimate * Difference in the two means A t-test compares sample means when the population standard deviation (σ) is unknown, and it uses the sample’s standard deviation (s) as an estimate. Flatter curve which get pointer with an increased sample size
39
When do I use a z-test vs. a t-test?
z-test if: The population standard deviation is known, or The sample is large thus gaussian (n(sample size) ≥ 30) and you trust the normal approximation. t-test if: The population standard deviation is unknown, or The sample is small, but you can assume (approximately) normal data.
40
What are the two types of interval estimates
95% confidence interval and standard error
41
What is bootstrapping
It is a resampling technique that draws samples with random replacements from the original data set to estimate the distribution of a dataset (the same observation can be sampled multiple times) to calculate confidence intervals from simulated proportions from resampled data that don't rely on assumptions about the base population
42
What is the significance of the original proportion in confidence interval and standard error
It acts as the basis for which we use bootstrapped data to produce a range/interval around
43
What are the key characteristics of Standard Error
1 standard deviation ~68% chance of capturing the true population parameter Narrower range Represented by +- More Commonly used
44
What are the key characteristics of 95% Confidence Intervals
95% chance of capturing the true population parameter Wider Range, and thus typically shown as a range Less commonly used
45
What is the propagation of uncertainty via bootstrapping
Bootstrapping is used to estimate the standard error (uncertainty) for each measurement by resampling the data. These uncertainties are then propagated through the function or model that calculates the desired 'output,' resulting in a range of possible values for the output that reflects the combined uncertainty of all inputs.
46
what is matching
********also can act as a control for some reason (Primary Definition) Matching: Pairing subjects based on similar characteristics to control for differences. *DONE IN OBSERVATIONAL STUDIES*
47
what is replication
Replication: Using a larger sample size to increase the accuracy of the results.
48
what is blocking
Blocking: Grouping subjects into blocks based on certain characteristics and then randomizing within each block. *DONE IN EXPERIMENTAL STUDIES*
49
Hypothesis testing can only be used to compare means or proportions. (True or False)
False **** we can also compare the medians because of flexibility **** False Explanation: Hypothesis testing is a versatile tool in statistics and is not limited to comparing means or proportions. It can also be used to: Compare variances (e.g., F-test). Test for independence in contingency tables (e.g., Chi-square test). Assess goodness-of-fit (e.g., Chi-square goodness-of-fit test). Evaluate correlation (e.g., testing if a correlation coefficient is significantly different from zero). Test regression coefficients in linear models. And much more!
50
Hypothesis testing can only be used to compare two groups. (True or False)
*** For this class's examples its true*** False Explanation: Hypothesis testing is not limited to comparing two groups. It can be used to compare: More than two groups: For example, ANOVA (Analysis of Variance) is used to compare means across three or more groups. Single groups: For example, a one-sample t-test compares the mean of a single group to a known value. Relationships between variables: For example, regression analysis tests the relationship between a dependent variable and one or more independent variables.
51
Hypothesis testing can only lead to binary decisions. (True or False)
*** True for this class***** False Explanation: While hypothesis testing often results in a binary decision (reject or fail to reject the null hypothesis), it is not limited to binary outcomes. Hypothesis testing also provides: p-values: A measure of the strength of evidence against the null hypothesis. Confidence intervals: A range of plausible values for the parameter being tested. Effect sizes: A measure of the magnitude of the observed effect. These additional outputs provide more nuanced insights beyond a simple "yes or no" decision. For example, a p-value of 0.06 might not lead to rejecting the null hypothesis at the 0.05 significance level, but it still suggests some evidence against the null hypothesis.
52
What are Type I errors controlled by
The significance level (alpha)
53
To minimize Type I errors do what
lower alpha or increase the sample size
54
how can you reduce type II errors
increase the sample size and increase alpha
55
What happens when you increase d in a t-test
the curve becomes more "pointy" as it becomes more Gaussian
56
What statements must be true in order for an analytical approach to be valid
The samples must be drawn from a pop. whose values are distributed along a Gaussian curve If the above condition is not satisfied then we must have a sufficiently large number of samples to thus fit the Gaussian Profile
57
What is the equation for the Population Density
(Number of trials in bin/Number of total trials in sample)/bin width
58
*confusing slide* how are simulated p-values slightly better than those found analytically
Simulated methods can utilize monte carlo simulations which are insentive to the shape of the population distribution, dont need to be a Gaussian fit and are more flexible to perform hypothesis testing on other kinds of numerical summaries of sample data, like using the median and such
59
What is the range of the 95% confidence interval
2.5-97.5
60
Does randomization include replacement?
No it does not, replacement is for bootstrapping only
61
What do simulations randomize
The predictor variable
62
what is recall bias
when participants have poor ability to recall past events accurately as needed for a study
63
what is convenience bias
participants selected by ease of access instead of random sampling
64
p-value is tied with what words
Rare event
65
rank on wideness
95CI>90CI>SE