HC 7 - Metabolomics Data Analysis 2: Biomarkers & Validation Flashcards by Tobias H

PCA scores and loadings

Scores: describe X
Loadings: variables important differences between samples

How well did you know this?

Not at all

Perfectly

DPLS scores and regression coefficients

PLSscores: describe X + predict Y
PLS regression coefficient: important variables for discrimination

How well did you know this?

Not at all

Perfectly

After a biomarker pattern is found, what is done next?

> Statistical validation and repeat steps from clean data to biomarker pattern
Biological and/or external validation

How well did you know this?

Not at all

Perfectly

Biological validation

New experiments
> PLS calculates scores and regression coefficients, but is the analysis correct?

How well did you know this?

Not at all

Perfectly

PLS model: multivariate model

Use coherence to make a model for discrimination of classes
> you need a PLS method which describes how the coherence between the variables is
> you want a situation with less variables than groups
> composed variables into scores

How well did you know this?

Not at all

Perfectly

Optimistic bias in statistic models

All statistic models are too optimistic because the parameters are estimated from the data used for creating the model

How well did you know this?

Not at all

Perfectly

Validation of model principle

Can the classification model preduct the health status of new individuals

How well did you know this?

Not at all

Perfectly

Statistical multivariate model validation approaches

-Average prediction error
-Statistical significance (p-value) of multivariate model

How well did you know this?

Not at all

Perfectly

Main points Average prediction error

-Confidence intervals for new samples
-Separate blinded test set
-For small number of samples: cross-validation: keep one individual away and use model to predict.

How well did you know this?

Not at all

Perfectly

Main points statistical significance of multivariate model

-For H0 distribution > is the difference larger than expected for equal groups
-Permutations: classify each individual randomly, does this situation differ from the prediction model?

How well did you know this?

Not at all

Perfectly

Average prediction error: test set validation

Use b-PLS coefficients to predict class of a testset from which the class is known. Measure the prediction error of the test set.
(e.g. the y-pred is 0.3 but it is a healthy individual who should be 1 and threshold is 0.5)

How well did you know this?

Not at all

Perfectly

Summarize prediction errors in a confusion table. How?

the columns are categorized Positive and negative as true condition (for patient and healthy for example) and the rows positive and negative for prediction.
> So the cell with negative true condition and positive prediction represent the false positives, and so on

How well did you know this?

Not at all

Perfectly

Sensitivity

True Positives / (True Positives + False Negatives)

How well did you know this?

Not at all

Perfectly

Specificity

True Negatives / (True negatives + False positives)

How well did you know this?

Not at all

Perfectly

Cross validation

For small sample size, divide samples in Xtest and Xtrain
> make an optimalized model M with Xtrain (M = bPLS coefficients)
> Use M to predict Y for Xtest
> Repeat with different Xtrain and Xtest until each sample has been in a test set
> Measure numbers of misclassifications

How well did you know this?

Not at all

Perfectly

When using your training set as a test set, then …

there is a biased prediction (too optimistic for new data)

How well did you know this?

Not at all

Perfectly

Statistical significance validation

Study These Flashcards

-Make permutations of class labels (or y-values)
-Make new models between X and y that should represent situations where there is no link between X and y
-Compare original prediction error with the prediction errors of many models of permutated data
-calculate p-value

How to calculate p-value for statistical test for validation of model

Study These Flashcards

p < (1+number of permutation models better than original) / all permutations

PLS-DA validation with cross model validation and permutation tests

Study These Flashcards

> measure misclassifications cross validation
calculate p-value with permutation models

The p-value for permutation validation is significant for metabolites which …

Study These Flashcards

are the best indicators for the coefficients and the classes

What is the H0 for permutations?

Study These Flashcards

During permutations, bPLS coefficients for the meaningless permutation models are made. The H0 is made from the amounts of misclassifications

What are the expected bPLS values for the permutation models?

Study These Flashcards

There is no relationship between X and y, therefore there is no effect for this variable and they are expected to vary around 0

If the variable (metabolite/biomarker) is important, then the bPLS coefficients from the model should be …. than the bPLS coefficients from permutation

Study These Flashcards

Larger or smaller.

Visualize a plot which shows variables raked by permutated coefficients on the x-axis and coefficients on the y-axis. Black lines are shown as a sideways parabolic around 0 and a red spikey curve inbetween with red circles where the red line crosses the black line. What could this mean?

Study These Flashcards

-The back lines indicate the 95% confidence interval H0 from permutaiton tests (H0: PLS reg coeff = 0)
-The red line: PLS Reg Coeff from original model per variable
-Red circles: points where the original model reaches out of the 95% CI of the permutation model PLS Reg Coeff and therefore is significantly unequal to the bPLS Reg Coeffs of the permutation test. These variables are good biomarkers

Explain the following concepts in one sentence: Cross validation, training set, test set, permutation test

-Cross validation: with a small number of samples -Training set: to create models (estimate loadings/coefficients) -Test set: new data to test PLS model regression coefficients by predicting the class of new samples -Permutation test: random ordering of class labels

Do you make a new model every time when you take individuals from the data for prediction in cross validation?

Yes, with cross-validation you create a new model every time, your training set is also always a little different because you leave out other samples. - You take out a sample and use the rest as a training set and the one sample(s) you take out use it as a test set. Instead of cross-validation, you prefer to use a good test cohort in which validation of the model can be done for individuals other than those from the training set.

PCA loadings (P) give information about

The most important variables to describe variation between samples in the data

DPLS: bPLS coefficients give most information on

Important variables for discrimination

Data interpretation, what does it mean if a certain group of metabolites is more abundant in healthy or disease condition?

Then they may be part of a network, or pathway > pathway analysis > network visualization > Comprehensive metabolite databases

Lets say you have a cumulative predictive accuracy list, and the top metabolite is maltose. What happens if we remove maltose from the prediction model?

The predictive accuracy becomes a lot smaller

To identify groups of metabolites which are very different between the groups and therefore predictive, what can be done with the cumulative prediction accuracy list?

Give metabolites color codes for different metabolite groups, like glucose metabolism for maltose

How do you recognize a specific over-represented group of metabolites in the cumulative accuracy list?

Lots of metabolites of same group in the top

For biological interpretation of the group of biomarkers, it is important to change interpretation from ... to ...

single metabolites to groups or pathways like biochemical pathways

Over-representation analysis input and output

Use a list of significant metabolites (not ordered) > metabolite set enrichment > over-representation analysis > result: biological processes which are different between the groups

Metabolite set enrichment

Give weights to metabolites based on relevance (fold change or importance)

Single Sample Profiling

-Metabolite concentrations fall in *normal* range, if not, compare deviation pattern to *known* causes

Over-representation analysis uses a .... distribution

hypergeometric

Binomial coefficient

n over k > in how many ways can I select k objects from a group of n? (n k) = n! / k!(n-k)!

Over-representation analysis using hypergeometric distribution. Use 52 metabolites divided into 4 pathway groups (A-D) of each 13 metabolites. If you select 12 metabolites, what is the probability that 10 are from pathway A (because this was actually measured)

Probability using binomial coefficients = ( (13|10) * (39|2) ) / (52|12) -(13|10): 10 metabolites from 13 possible in A -(39|2): 2 metabolites from 39 of B-D. -(52|12): all possibilities of selecting 12 from 52.

If the probability from the over-representation is very small, then ...

there is a systematic effect between two groups, which is important (we say the chance is too low to consider the result being chance, the disease (or class) is related to this metabolite group > pathway A differs between the two groups)

HC 7 - Metabolomics Data Analysis 2: Biomarkers & Validation Flashcards

hoorcollege 7 (40 cards)