flashcard 4

(50 cards)

1
Q

What is the main purpose of statistics in research?

A

To collect, analyze, interpret, and present numerical data in order to distinguish real patterns from random variation and make informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name the four primary tasks involved in statistical analysis.

A

1) Designing experiments and collecting data, 2) Describing and summarizing data, 3) Testing hypotheses (inferential statistics), and 4) Building and evaluating predictive or explanatory models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is it important to design experiments carefully before collecting data?

A

Because a well-designed experiment ensures that data are representative, unbiased, and appropriate for answering the research question, making subsequent analysis valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does descriptive statistics differ from inferential statistics?

A

Descriptive statistics summarize and visualize the features of a dataset (e.g., mean, median, charts), whereas inferential statistics draw conclusions or make predictions about a population based on sample data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what a random variable is in probability theory.

A

A random variable is a numerical representation of outcomes in an experiment, where each possible outcome is assigned a probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What distinguishes a discrete random variable from a continuous random variable?

A

Discrete variables take on countable values (e.g., number of successes), while continuous variables can take any value within an interval (e.g., height or weight).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can probability be interpreted through the lens of long-run frequencies?

A

As the proportion of times an event occurs out of many repetitions of the same experiment, approaching a stable value as trials increase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define sensitivity in the context of diagnostic testing.

A

Sensitivity measures the ability of a test to correctly identify individuals who have the condition (true positives) out of all actual positives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define specificity in diagnostic testing.

A

Specificity measures the ability of a test to correctly identify individuals who do not have the condition (true negatives) out of all actual negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do positive predictive value (PPV) and negative predictive value (NPV) convey about a diagnostic test?

A

PPV indicates the probability that someone with a positive test truly has the condition; NPV indicates the probability that someone with a negative test truly does not have the condition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why does the prevalence of a disease in a population affect PPV and NPV?

A

Because PPV and NPV depend on the proportion of true cases versus non-cases; in low-prevalence settings, even tests with high sensitivity and specificity can yield a low PPV.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the key differences between randomized controlled trials (RCTs) and cohort studies.

A

In RCTs, participants are randomly assigned to treatment or control, controlling for confounders. In cohort studies, participants are followed over time based on exposure status without random assignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the distinction between prospective and retrospective study designs?

A

Prospective studies collect data moving forward from a defined point, while retrospective studies analyze existing data that were collected in the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Compare cross-sectional and longitudinal studies.

A

Cross-sectional studies measure variables at a single point in time, providing a “snapshot,” whereas longitudinal studies follow the same subjects over multiple time points to observe changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What defines quantitative data versus qualitative (categorical) data?

A

Quantitative data are numerical measurements (either discrete or continuous), while qualitative data describe categories or attributes (nominal or ordinal).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give examples of nominal and ordinal qualitative variables.

A

Nominal examples: blood type, type of cuisine. Ordinal examples: pain severity scale, education level (high school, bachelor’s, master’s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a probability distribution, and why is it useful?

A

A probability distribution describes how probabilities are assigned to different values of a random variable, helping to understand the variable’s behavior and make inferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Name three common probability distributions and their typical applications.

A

Normal distribution for continuous traits in populations; Binomial distribution for counts of successes in fixed trials; Poisson distribution for rare event counts over a fixed interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain the concept of a normal distribution and its key characteristics.

A

A normal distribution is symmetric, bell-shaped, and defined by its mean (center) and standard deviation (spread); most values cluster around the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the empirical rule (“68–95–99.7 rule”) describe?

A

In a normal distribution, approximately 68% of values fall within one standard deviation of the mean, 95% within two, and 99.7% within three.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why might one choose the median over the mean as a measure of central tendency?

A

Because the median is less sensitive to extreme values or skewed data, providing a better “central” value when outliers are present.

22
Q

How do you interpret the standard deviation of a dataset?

A

Standard deviation quantifies the average distance of data points from the mean, reflecting the spread or variability in the dataset.

23
Q

What is the standard error of the mean (SEM), and how does it differ from standard deviation?

A

SEM measures the variability of sample means if the same experiment were repeated, whereas standard deviation measures variability of individual observations around the sample mean.

24
Q

Define skewness and explain what positive or negative skew indicates about a distribution’s shape.

A

Skewness measures asymmetry of a distribution. Positive skew (right-tailed) indicates a long tail on the higher side; negative skew (left-tailed) indicates a long tail on the lower side.

25
What does kurtosis describe in a distribution?
Kurtosis measures the “tailedness” or “peakedness” of a distribution compared to a normal distribution, indicating the propensity for outliers.
26
Why is it important to assess both location and dispersion when summarizing data?
Because measures of central tendency (location) alone don’t convey the spread or variability; dispersion measures (e.g., range, interquartile range, standard deviation) describe how data are distributed around the center.
27
What is a contingency table, and when is it used?
A contingency table (cross-tabulation) displays frequencies of two categorical variables simultaneously, facilitating analysis of associations or interactions.
28
Define the Pearson correlation coefficient (r).
A statistic that quantifies the strength and direction of a linear relationship between two continuous variables, ranging from –1 (perfect negative) to +1 (perfect positive).
29
Under what circumstances might Pearson correlation be misleading?
When the relationship is non-linear, when outliers heavily influence the data, or when the variables are not measured on an interval/ratio scale.
30
What is Spearman rank correlation, and why is it used instead of Pearson correlation in some cases?
Spearman correlation assesses the strength of a monotonic relationship using ranked data; it is robust to outliers and appropriate for ordinal data or non-linear (but monotonic) associations.
31
Describe when you would use a histogram versus a boxplot to explore a dataset.
Use a histogram to visualize the detailed shape of a distribution (e.g., modality, skewness). Use a boxplot to summarize distribution in terms of median, quartiles, and identify outliers.
32
In a boxplot, what do the box, whiskers, and outliers represent?
The box spans the interquartile range (Q1 to Q3) with a line at the median; whiskers extend to the most extreme values within 1.5×IQR; points beyond whiskers are outliers.
33
How do sample size and variability affect the reliability of statistical estimates?
Larger sample sizes reduce the standard error, making estimates more precise. Greater variability in data increases uncertainty around estimates, requiring larger samples to achieve the same precision.
34
Explain why random sampling is important in statistical studies.
Random sampling minimizes selection bias and ensures that the sample represents the broader population, allowing valid inferences and generalizations.
35
What are type I and type II errors in hypothesis testing?
A type I error is incorrectly rejecting a true null hypothesis (false positive). A type II error is failing to reject a false null hypothesis (false negative).
36
How does significance level (α) relate to type I error?
The significance level (α) is the probability threshold for rejecting the null hypothesis; setting α = 0.05 means there is a 5% chance of making a type I error.
37
What is statistical power, and what factors influence it?
Statistical power is the probability of correctly rejecting a false null hypothesis (1 − type II error rate). It increases with larger sample size, larger effect size, and lower data variability.
38
Define p-value in the context of hypothesis testing.
The p-value is the probability of observing data as extreme or more extreme than what was collected, assuming the null hypothesis is true.
39
Why is it important to distinguish between statistical significance and practical significance?
A result can be statistically significant (unlikely due to chance) but have a trivial or clinically irrelevant effect size, so one must assess real-world impact.
40
Describe the difference between one-sided and two-sided hypothesis tests.
A one-sided test assesses deviation in a specific direction (e.g., greater than); a two-sided test assesses deviations in both directions (greater or less than).
41
What is a confidence interval, and how is it interpreted?
A confidence interval provides a range of plausible values for a population parameter (e.g., mean) with a specified confidence level (e.g., 95%), indicating that if the study were repeated, the true parameter would fall within that range most of the time.
42
When might nonparametric methods be preferable to parametric methods?
When data do not meet parametric assumptions (e.g., normality, equal variances), have small sample sizes, or include ordinal/categorical variables.
43
Give an example of a nonparametric test and its basic application.
The Wilcoxon rank-sum test (Mann–Whitney U) compares the central tendency of two independent groups when data are skewed or ordinal.
44
Explain the purpose of analysis of variance (ANOVA).
ANOVA tests whether there are statistically significant differences among the means of three or more independent groups by comparing between-group variability to within-group variability.
45
What is a chi-squared test used for?
To assess whether observed frequencies in categorical data differ from expected frequencies under a null hypothesis of no association or no difference.
46
How does logistic regression differ from linear regression?
Logistic regression models the probability of a binary outcome using a logit link, whereas linear regression predicts a continuous outcome using a linear relationship.
47
Why is it important to assess residuals after fitting a regression model?
Checking residuals helps verify model assumptions—such as normality, homoscedasticity, and independence—and identify potential outliers or influential observations.
48
Define overfitting in the context of model building.
Overfitting occurs when a model captures noise or random fluctuations in the training data, performing well on that data but poorly on new, unseen data.
49
What strategies can help prevent overfitting?
Use simpler models, cross-validation, regularization techniques (e.g., Lasso, Ridge), and ensure adequate sample size relative to the number of predictors.
50
Why should descriptive plots always accompany numerical summaries in data analysis?
Because visualizations (e.g., histograms, scatterplots) reveal patterns, outliers, and distribution shapes that numerical summaries alone might obscure.