Metabolomics 5 - Basic statistics Flashcards

1
Q

Omics Data Analysis

A

DATA PROCESSING & QC
Omics data
- NMR
- Mass spectrometry

STATISTICAL ANALYSIS AND VISUALIZATION
- Comparison Clustering Classification

FUNCTIONAL INTERPRETATION
Omics-specific
- enrichment analysis
- pathway analysis

UNIQUE FUNCTIONS
Field specific
- dose response
- biomarker analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of Metabolomics Data

A

Raw data (fingerprinting)
* No information on metabolites
* Use raw NMR spectra or MS data
* Long-time standard in NMR
* Goal: Derive classes and identify markers * STOCSY for correlation tests

Metabolite concentrations (lists of compounds with concentration values)
* MS and NMR analyses now produce lists of metabolite concentrations
* Concentrations can be used for univariate tests
* Concentrations can also be used in very specific profiling#
* Correlations and covariances are commonly used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Common Terms

A

Dimension
* The number of variables (metabolites, peaks)

Univariate:
* Analysing one variable per subject

Multivariate
- Analysing many variables per subject
- Omics data are usually high-dimensional data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Basic statistical terms

A

Mean
- synonyms: average

Median
- the value that one-half of the data lies above and below
- synonyms: 50th percentile

Variance
- the sum of squared deviations from the mean divided by n-1 where n is the number of data values
- synonyms: mean squared error

Order statistics
* Metrics based on the data values sorted from smallest to biggest.
* Synonyms: ranks

Percentile
* The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more.
* Synonyms: quantile

Interquartile range
- The difference between the 75th percentile and the 25th percentile.
- Synonyms: IQR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Variance

A

Why square?
- Eliminated negatives
The standard deviation is much easier to interpret than the variance since it is on
the same scale as the original data. Still, with its more complicated and less
intuitive formula, it might seem peculiar that the standard deviation is preferred in
statistics over the mean absolute deviation. It owes its preeminence to statistical
- Parabolic behaviour: Increasing contribution further from the mean

Standard Deviation:
stddev = sqrt(s)
theory: mathematically, working with squared values is much more convenient

Shows Variation about the mean, same units as original data intuitive formula, it might seem peculiar that the standard deviation is preferred in
Why square?
- Eliminated negatives
- Parabolic behaviour: Increasing
The standard deviation is much easier to interpret than the variance since it is on the same scale as the original data. Still, with its more complicated and less statistics over the mean absolute deviation. It owes its preeminence to statistical theory: mathematically, working with squared values is much more convenient
Why Divide By (n‐1), not n ?
If you knew the sample mean and
=> n-1 degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Box-and-whisker plot

A
  • The 1st quantile Q1 is the value for which 25% of the observations are smaller and 75% are larger
  • Q2 is the same as median (50% are smaller and 50% larger)
  • Q3 only 25% of the observations are larger
  • Inter Quartile Range (IQR) is Q3-Q1. It covers 50% of the observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Percentiles

A

In general the nth percentile is a value such that n% of the observations fall at or below it

Q1 = 35th percentile
Median = 50th percentile
Q2 = 75th percentile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Other common distributions

A
  • unimodal
  • bimodal
  • skewed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Mean vs median - which is best?

A
  • Mean is best for symmetric distributions without outliers
  • Median is useful for skewed distributions of data with outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

From samples to populations

A

So how do we know whether the effect observed in our sample was genuine?
- We don’t
Instead we use p values to indicate our level of uncertainty that our results represent a genuine effect present in the whole population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

p-Values

A
  • P-values = the probability that the observed result was obtained by chance
    -> i.e. the null hypothesis H0 is true
  • If that probability (p-value) is small, it suggests the observed result cannot be easily explained by chance
  • P-values: a measurement that assumes the null hypothesis is correct, meaning that if the value is small, then you can reject the null hypothesis in favour of the alternative hypothesis.
  • A large p-value typically means that the data point or set you measured aligns with the null hypothesis, making it the more likely outcome.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hypothesis Testing

A

The null hypothesis H0
- No statistical significance between an observed result and the data set to which it belongs
- There is no difference between the case and control groups
- H0: μ1-μ2 = 0

The alternative hypothesis
- Opposite of the null hypothesis
- Hypothesis with statistical significance
- Generally the hypothesis that is believed by the researcher
- HA: μ1-μ2 ≠ 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

p Values and level of significance

A
  • P-value: Probability of an extreme result occurring (outside red line)
  • Level of significance, ɑ: Specified to define the rejection area
  • Rejection region: all values for which H0 will be rejected
  • P-value: Probability of an observed result (or more extreme) assuming that
    the null hypothesis is true
  • Between the lines: 95% probability that value is > left line and < right line
    How to calculate p-values:
  • Add up percentages of areas under the curve for the wings and divide by the total area under the curve
    -> In other words, there is a 95% probability that each time we measure a Brazilian woman, their height will be between 142 and 169 cm.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Empirical p-values

A
  • Parametric: p-values are based on well defined models, Gaussian distributions, Poisson distribution
  • What if we don’t know the distribution?
    -> The only thing we know is that the data does not follow a normal distribution
  • We can find out the null distribution from the data itself, then calculate the p-value
    -> Also known as empirical p-values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

One sample t-test

A
  • one-sample t-test is used to compare the mean m of one sample to a known standard (or theoretical/hypothetical) mean (μ)
    m = sample mean, n = sample size, μ = theoretical mean, s = stddev

Research question:
- whether the mean (m) of the sample is equal to the theoretical mean (μ)?
- whether the mean (m) of the sample is less or greater than the theoretical mean (μ)?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two samples t-test -> unpaired t-test

A
  • The unpaired two-samples t-test is used to compare the mean of two independent groups.
  • Example: Measured weight of 100 individuals: 50 women (group A) and 50 men (group B). We want to know if the mean weight of women (mA) is significantly different from that of men (mB).
    -> Two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t- test to evaluate whether the means are different.
  • Research question:
    -> whether the mean of group A (mA) is equal to the mean of group B (mB)?
    -> whether the mean of group A (mA) is less or greater than the mean of group B
    (mB)?
  • Classical t-test:
    -> If the variance of the two groups are equivalent (homoscedasticity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Two samples t-test-> Classical t-test

A
  • If the variance of the two groups are equivalent (homoscedasticity)
  • mA and mB represent the mean value of the group A and B, respectively. nA and nB represent the sizes of the group A and B, respectively.
  • S2 is an estimator of the pooled variance of the two groups.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Two samples t-test -> Welch t-statistics

A
  • If the variances of the two groups being compared are different (heteroscedasticity), it’s possible to use the Welch t test, an adaptation of Student t-test (no pooled variance S).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

ANOVA test

A
  • The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of the independent two-sample t-test for comparing means in a situation where there are more than two groups.
  • In one-way ANOVA, the data is organised into several groups based on one single grouping variable (also called factor variable).
  • ANOVA test hypotheses:
    -> Null hypothesis: the means of the different groups are the same
    -> Alternative hypothesis: At least one sample mean is not equal to the others.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

ANOVA test -> What is calculated?

A

Assume that 3 groups (A, B, C) to compare:
- Compute the common variance, which is called variance within samples (S2within) or residual variance.
- Compute the variance between sample means as follow:
- Compute the mean of each group
- Compute the variance between sample means (S2between)
- Produce F-statistic as the ratio of S2between/S2within

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Correlation

A

Correlation is a bivariate analysis
that measures the strength of association between two variables
and the direction of the relationship.

Strength of relationship:
- the value of the correlation coefficient varies between +1 and -1.
- A value of ± 1 indicates a perfect degree of association between the two variables.
- As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker.
- The direction of the relationship is indicated by the sign of the coefficient;
- a + sign indicates a positive relationship and
- a – sign indicates a negative relationship.

22
Q

Covariance: The maths

A
  • Data matrix X (n samples, p data points)
  • Covariance matrix C of mean centred data
    … remember matrix multiplication
23
Q

Correlation: The maths

A
  • Data matrix X (n samples, p data points)
  • Correlation matrix C of mean centred data
24
Q

Pearson Correlation -> Assumption

A
  • each observation should have a pair of values
  • each variable should be continuous
  • each variable should be normally distributed
  • it should be an absence of outliers
  • it assumes linearity and homoscedasticity
25
Q

Spearman rank correlation
-> Data values are replaced by rank
-> Assumptions

A
  • pairs of observations are independent
  • two variables should be measured on an ordinal, interval or ratio scale
  • it assumes that there is a monotonic relationship between the two variables
26
Q

Correlation with hypothesis testing

A

rho = corr(X) returns a matrix of the pairwise linear correlation coefficient between each pair of columns in the input matrix X.
rho = corr(X,Y) returns a matrix of the pairwise correlation coefficient between each pair of columns in the input matrices X and Y.
[rho,pval] = corr(X,Y) also returns pval, a matrix of p-values for testing the hypothesis of no correlation against the alternative hypothesis of a nonzero correlation.

27
Q

Pearson correlation vs Spearman and Kendall correlations

A
  • Non-parametric correlations are less powerful because they use less information in their calculations.
  • In the case of Pearson correlation uses information about the mean and deviation from the mean, while non-parametric correlations use only the ordinal information and scores of pairs.
  • In the case of non-parametric correlation, it’s possible that the X and Y values can be continuous or ordinal, and approximate normal distributions for X and Y are not required.
  • But in the case of Pearson correlation, it assumes the distributions of X and Y should be normal distribution and also be continuous.
  • Correlation coefficients only measure linear (Pearson) or monotonic (Spearman and Kendall) relationships.

Kendall uses a different ordering algorithm than Spearman
* In the normal case, Kendall correlation is more robust and efficient than Spearman correlation. It means that Kendall correlation is preferred when there are small samples or some outliers.
* Kendall correlation has an O(n^2) computation complexity comparing with O(n logn) of Spearman correlation, where n is the sample size.
* Spearman’s rho usually is larger than Kendall’s tau.
* The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.

28
Q

STOCSY

A
  • STOCSY (Statistical Correlation Spectroscopy) is commonly used to detect correlated and anti-correlated changes
  • Correlated changes: A second metabolite is up together with another
  • Anti-correlated: If 1 goes up, two is down (e.g. glucose consumption and lactate production in glycolysis)
29
Q

Principle component analysis on…. Covariance Matrix

A
  • Variables must be in same units
  • Emphasizes variables with most variance
30
Q

Principle component analysis on… Correlation Matrix

A
  • Variables are standardized (mean 0.0, SD 1.0)
  • Variables can be in different units
  • All variables have same impact on analysis
31
Q

Principal component analysis

A
  • PCA finds the larges correlations between different variables within a data set
    -> PCA is closely linked to covariance matrices
  • PCA finds combinations of variables, or factors, that describe major trends in the data
  • Linear transformation of variables so that:
    -> Data is described with a minimal set of new variables (measure for relevance: max. variance)
  • The Principal Component Analysis (PCA) is equivalent to fitting an n-dimensional ellipsoid to the data, where the eigenvectors of the covariance matrix of the data set are the axes of the ellipsoid.
  • The eigenvalues represent the distribution of the variance among each of the eigenvectors.
  • Let’s start with a n=2 dimensional example to perform a PCA without the use of the MATLAB function pca, but with the function of eig for the calculation of eigenvectors and eigenvalues.
  • Purpose:
  • explorative data analysis discover correlations in 2D plots
  • Modelling (e.g. regression) with new data
  • Minimize danger of artifacts
  • Data reduction
  • Eliminate irrelevant information (noise)
32
Q

History of Principle Component Analysis

A

1901 Karl Pearson, Mathematician, Philosophical Magazine
1933 Harold Hotelling, Statistician
Louis Leon Thurstone, Director of the Psychometric Laboratory of North Carolina, USA
“Multiple Factor Analysis”
1960 Edmund Malinowski, Bruce Kowalski
Introduction of PCA into chemistry as “factor analysis”
Also called Karhunen-Loeve-Transformation (KLT)

33
Q

Principal component analysis -> coordinate system

A
  • Rotation of the coordinate system (multiplication with an orthogonal matrix)
  • Information content of higher principal components decreases rapidly
    => can be omitted without loss
  • Measure for information content is the amount of total variance covered
  • “the maximum variance” and “the minimum error” are reached at the same time
34
Q

Principal component analysis: 2D Example

A

Matlab code: pca_2d_example.m
* Find the unit vector pointing into the direction with the largest variance within the bivariate data set data.
* The solution of this problem is to calculate the largest eigenvalue D of the covariance matrix C and the corresponding eigenvector V
𝐶 ∗ 𝑉=𝜆 ∗ 𝑉
which is equivalent to
(𝐶−𝐷∗𝐸) ∗𝑉=0
where 𝐸 is the identity matrix

35
Q

Principal component analysis: 2D Example

A
  • The rotation creates new variables which are uncorrelated, i.e. the covariance is zero for all pairs of the new variables.
  • The decorrelation is achieved by diagonalizing the covariance matrix C. The eigenvectors V belonging to the diagonalized covariance matrix are a linear combination of the old base vectors, thus expressing the correlation between the old and the new time series.
  • The eigenvalues D of the covariance matrix gives the variance within the new coordinate axes, i.e. the principal components.
  • The mathematical procedure involves calculating the determinant of C det(C-D*E)=0
36
Q

PCA in multiple dimensions

A
  • Matrix X of rank r is expressed as a sum of matrices M of rank 1
  • For an p x n matrix with p>n the rank is r ≤ p
  • Rank: order of the largest quadratic submatrix with determinant ≠ 0 (submatrix is formed from a matrix by eliminating lines or columns)
  • Rank = number of independent variables (when a line or column can be expressed as a linear combination of other lines or columns it is not independent)
  • A quadratic matrix with determinant = 0 is singular and cannot be inverted
    *
    Vectors t and p are chosen so that
  • p vectors are pairwise orthonormal
  • t vectors are orthogonal
  • Each t vector (scores, new coorrdinate) contains the maximum of the residual variance)
37
Q

The maths behind PCA

A

X is a matrix with our data, n x p sized, p data points in each spectrum (variables), n the number of spectra (samples)
Let’s assume it is mean centered, i.e. X = Xorig – Xmean
The covariance matrix is defined by C=X X^T /(n-1)
It is a p x p matrix which can be diagonalised: C = V L VT
Where V is an Eigenvector matrix and L is a diagonal matrix with Eigenvalues gammai

38
Q

What is SVD (Singular Value Decomposition)

A

Eigenvalues of XX^T are Λ = S2
U = Eigenvalues of XX^T
V = Eigenvalues of X^TX
i.e. transposition of X interchanges loadings and scores.

39
Q

PCA and SVD

A

C = X^TX / (n-1) where X is mean centered
C = V L V^T
Where V is an Eigenvector matrix and L is a diagonal matrix with Eigenvalues gammai
The Eigenvectors are called principal axes or principle directions of the data
Projections of the data on the principle axis (calculated as XV) are called principle components or PC scores
The trick is not to calculate EVs for C, but just to calculate a singular value decomposition for X
X = USV^T Singular value decomposition
This is just a matrix decomposition that yields the X matrix,
U and V are orthonormal UU^T = I and VV^T =I
S is a diagonal matrix with singular values si ordered by size, gammai = si/(n-1)
C = X^T X /(n-1) = VSU^T USV^T/(n-1) = V S^2/(n-1) V^T
= V L V^T singular vectors of V are principle directions
Principle components are: XV = USV^T = US Columns of US are scores

40
Q

Summary PCA and SVD

A

C = X^TX / (n-1) where X is mean centered
X = USV^T Singular value decomposition
S is a diagonal matrix with singular values si ordered by size, li = si/(n-1)
C = X^T X /(n-1) = VSU^T USV^T/(n-1) = V S^2/(n-1) V^T
= V L V^T singular vectors of V are principle directions
Principle components are: XV = USV^T = US
PCs: Columns of US,
Scores: sqrt(n-1)U
Loadings: Columns of VS/sqrt(n-1) or C*V
Multiplying the first k PCs by the corresponding principal axes Vk^⊤ yields Xk = UkSkVk^⊤ matrix that has the original n×p size but is of lower rank (of rank k). This matrix Xk provides a reconstruction of the original data from the first k PCs.

41
Q

Supervised Learning vs Unsupervised Learning

A

Supervised learning -> Input and Output data -> Predictions

Unsupervised learning -> Input data -> pattern or structure discovery

42
Q

PLS-DA

A
  • When the experimental effects are subtle or moderate, PCA will not show good separation patterns
  • PLS-DA is a supervised method, it is calculated by maximizing the co-variance between the data matrix (X) and the class labels (Y)
  • PLS-DA always produces certain separation patterns with regard the conditions

The loadings plot shows the variable influence on the separation.

43
Q

Use PLS-DA with caution

A
  • PLS-DA is susceptible to over-fitting by producing patterns of separation even for data randomly drawn from the same population
  • Need cross validation
  • Need permutation tests

-> there is a chance for underfitting and overfitting

44
Q

Cross validation

A

Goal: test whether your model can predict class labels for new samples

45
Q

Common cross-validation methods

A
  • leave-one out (n-fold cross-validation)
46
Q

PLS Error Estimation

A
  • Cross validation is used to determine the optimal number of components needed to build the PLS-DA model.
  • Three common performance measures:
    -> Sum of squares captured by the model (R2)
    -> Cross-validated R2 (also known as Q2)
    -> Prediction accuracy
47
Q

Permutation Tests

A

Goal: to test whether your model is significantly different from the null models

  1. Randomly shuffle the class labels (y) and build the (null) model between new y and x;
  2. Test whether there is still the similar distances of separation;
  3. We can compute empirical p values
    -> If the result is similar to the permuted results (i.e. null model), then we can NOT say y and x are significantly correlated
48
Q

VIP Scores

A
  • Variable Importance in Projection (VIP) scores estimate the importance of each variable in a PLS- DA model
    -> Weighted sum of the squared correlations between the PLS-DA components and the original variable
    -> Weights correspond to the percentage variation explained by the PLS-DA component in the model
  • VIP >1 can be considered important
  • VIP <1 is less important and might be a good candidate for exclusion from the model
49
Q

Evaluating PLS Performance

A
  • Basic concepts
    -> True positives (TP)
    -> True negatives (TN)
    -> False positives (FP)
    -> False negatives (FN).
    -> Sensitivity (Sn)
    -> Specificity (Sp)
  • Sn (sensitivity) = True positive rate
  • Sp (specificity) = True negative rate
50
Q

ROC Curves

A
  • ROC = Receiver Operating Characteristic
    -> A historic name from radar studies
    -> Very popular in biomedical applications
    –> To assess performance of classifiers.
    –> To compare different biomarker models
  • A graphical plot of the true positive rate (TPR) vs. false positive rate (FPR), for a binary classifier (i.e. positive/negative) as its cutoff point is varied
51
Q

Area Under ROC Curve (AUC)

A
  • Overall measure of test performance
  • Comparisons between two tests based on differences between (estimated) AUC