HC 7 - Metabolomics Data Analysis 2: Biomarkers & Validation Flashcards
hoorcollege 7 (40 cards)
PCA scores and loadings
Scores: describe X
Loadings: variables important differences between samples
DPLS scores and regression coefficients
PLSscores: describe X + predict Y
PLS regression coefficient: important variables for discrimination
After a biomarker pattern is found, what is done next?
> Statistical validation and repeat steps from clean data to biomarker pattern
Biological and/or external validation
Biological validation
New experiments
> PLS calculates scores and regression coefficients, but is the analysis correct?
PLS model: multivariate model
Use coherence to make a model for discrimination of classes
> you need a PLS method which describes how the coherence between the variables is
> you want a situation with less variables than groups
> composed variables into scores
Optimistic bias in statistic models
All statistic models are too optimistic because the parameters are estimated from the data used for creating the model
Validation of model principle
Can the classification model preduct the health status of new individuals
Statistical multivariate model validation approaches
-Average prediction error
-Statistical significance (p-value) of multivariate model
Main points Average prediction error
-Confidence intervals for new samples
-Separate blinded test set
-For small number of samples: cross-validation: keep one individual away and use model to predict.
Main points statistical significance of multivariate model
-For H0 distribution > is the difference larger than expected for equal groups
-Permutations: classify each individual randomly, does this situation differ from the prediction model?
Average prediction error: test set validation
Use b-PLS coefficients to predict class of a testset from which the class is known. Measure the prediction error of the test set.
(e.g. the y-pred is 0.3 but it is a healthy individual who should be 1 and threshold is 0.5)
Summarize prediction errors in a confusion table. How?
the columns are categorized Positive and negative as true condition (for patient and healthy for example) and the rows positive and negative for prediction.
> So the cell with negative true condition and positive prediction represent the false positives, and so on
Sensitivity
True Positives / (True Positives + False Negatives)
Specificity
True Negatives / (True negatives + False positives)
Cross validation
For small sample size, divide samples in Xtest and Xtrain
> make an optimalized model M with Xtrain (M = bPLS coefficients)
> Use M to predict Y for Xtest
> Repeat with different Xtrain and Xtest until each sample has been in a test set
> Measure numbers of misclassifications
When using your training set as a test set, then …
there is a biased prediction (too optimistic for new data)
Statistical significance validation
-Make permutations of class labels (or y-values)
-Make new models between X and y that should represent situations where there is no link between X and y
-Compare original prediction error with the prediction errors of many models of permutated data
-calculate p-value
How to calculate p-value for statistical test for validation of model
p < (1+number of permutation models better than original) / all permutations
PLS-DA validation with cross model validation and permutation tests
> measure misclassifications cross validation
calculate p-value with permutation models
The p-value for permutation validation is significant for metabolites which …
are the best indicators for the coefficients and the classes
What is the H0 for permutations?
During permutations, bPLS coefficients for the meaningless permutation models are made. The H0 is made from the amounts of misclassifications
What are the expected bPLS values for the permutation models?
There is no relationship between X and y, therefore there is no effect for this variable and they are expected to vary around 0
If the variable (metabolite/biomarker) is important, then the bPLS coefficients from the model should be …. than the bPLS coefficients from permutation
Larger or smaller.
Visualize a plot which shows variables raked by permutated coefficients on the x-axis and coefficients on the y-axis. Black lines are shown as a sideways parabolic around 0 and a red spikey curve inbetween with red circles where the red line crosses the black line. What could this mean?
-The back lines indicate the 95% confidence interval H0 from permutaiton tests (H0: PLS reg coeff = 0)
-The red line: PLS Reg Coeff from original model per variable
-Red circles: points where the original model reaches out of the 95% CI of the permutation model PLS Reg Coeff and therefore is significantly unequal to the bPLS Reg Coeffs of the permutation test. These variables are good biomarkers