Statistics Flashcards
(36 cards)
What is a p-value?
The probability of observing data as extreme or more extreme under the null hypothesis.
What does a 95% confidence interval mean?
We are 95% confident that the interval contains the true population parameter.
When should you use a t-test?
When comparing the means of two groups with normally distributed data.
What is ANOVA and when is it used?
Analysis of variance; used to compare means across ≥3 groups.
What is multicollinearity?
When two or more predictors in a regression model are highly correlated.
Define statistical power.
The probability of correctly rejecting the null hypothesis when it is false.
What is Bonferroni correction?
Adjusting the p-value threshold by dividing it by the number of comparisons to control false positives.
What is FDR?
False Discovery Rate: the expected proportion of false positives among the declared significant results
What is overfitting?
When a model learns noise rather than the true pattern, leading to poor generalization.
What is cross-validation?
A method to assess model generalizability by splitting data into training and testing subsets.
Why is scaling important in metabolomics?
Metabolite concentrations vary widely; scaling ensures each variable contributes equally to the analysis.
Difference between autoscaling and pareto scaling?
Autoscaling: mean-centering + divide by SD. Pareto: divide by √SD. Pareto retains more biological variance.
What is normalization in metabolomics?
Adjusting data to correct for sample size, instrument drift, or batch effects (e.g., total area, internal standards).
What is VIP score in PLS-DA?
Variable Importance in Projection; indicates the contribution of each variable to the model.
How is model validity assessed in PLS-DA?
Using cross-validation, permutation tests, and metrics like R², Q².
What is R² vs Q²?
R²: explained variance. Q²: predictive ability (cross-validated). High Q² indicates a robust model.
Why are univariate and multivariate analyses both used?
Univariate pinpoints individual features; multivariate captures patterns and correlations among metabolites.
What is a volcano plot?
A scatterplot of –log10(p-value) vs log2(fold change) to visualize significant metabolite differences.
What is hierarchical clustering?
Unsupervised method grouping samples or variables based on similarity (distance metrics).
What is k-means clustering?
.
Partitions data into k clusters based on minimizing within-cluster variance
What is a heatmap used for in metabolomics?
To visualize patterns across samples and features; often clustered by similarity.
What is metabolite identification confidence level?
Level 1 (confirmed with standards) to Level 4 (unknown), as per MSI (Metabolomics Standards Initiative).
What is ROC analysis used for?
Evaluating diagnostic accuracy (AUC, sensitivity, specificity).
What is bootstrapping?
A resampling method to estimate the variability of a statistic.