VL 11 Flashcards

1
Q

What is multidimensional scaling?

A
  • visualization technique for multidimensional data
  • requires distance matrix, no original variable information
    can be retrieved later
  • idea: visualize all pairwise object dissimilarities as good as possible in a lower dimensional space (often 2D)
  • example of road-distances between European countries
    visualized in a 2D plot
  • difference between visual representation on distance
    matrix: stress
  • usually you use a distance map from a distance table to visualise data. If you don’t have a distance table create one dist.name (function in R)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Similarity and distance measures

A
  • evaluate similarity/distance between two objects
  • put similar objects together, disimilar apart
  • similarity or distance measure required
  • Conventions:
    – Sij = 1 : i and j are identical in their properties
    – Sij = 0 : i and j are different in all properties
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Continuous Data Distance

A

1 dimension
Euclidean = = Manhatten

2 dimensional
- Euclidean (direkter weg zwischen zwei punkten)
- Manhattan (Erst Y dann X)

n Dimensions
- Minkowski Distance (generalisation of exponent e)
- Canberra Distance (weighted version of Manhattan)
- Pearson-Correlation distance

–> All distance maps should give similar results, always check two (mostly check e.g Euclidean and Correlation)

Alle formeln auf Spicker..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Numeric Data and categorical Data and distance methods.

A
  • numerical data – Euclidean
    – Manhattan
    – Correlation (scale invariant) * binary data
    – Jaccard
    – Matching coefficient – (Manhattan)
  • sequence data
    – edit distances (Levenshtein, Needleman-Wunsch) – 1 - (percent similarity/100)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is PCA?

A

PCA (Principal Component Analysis) in bioinformatics is a statistical method used to reduce the dimensionality of high-dimensional biological data while retaining its important patterns and variability. It helps researchers analyze and visualize large datasets and identify significant features in biological samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does imputing data mean?

A
  • Imputing basically means to use regression to predict
    missing value
  • imputation works quite ok if we have not too many NA’s (less than 5-10%)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s the diffrence between mathematically PCA and geographical PCA?

A
  • Geometrically
    – Rotation of space to maximize variance for fewer
    coordinates
    -PCA is used to visualise Data in a lower dimensional State
  • PCA-Maths
    Covariance matrix and eigenvector
    – Eigenvector with largest eigenvalue is first principal component (PC)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s a covariance matrix?

A
  • variables a, b, c, d
  • square matrix of covariances
  • in the diagonals are the variances * cov(a,a) == var(a)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Eigenvector and Eigenvalue

A
  • square matrix = eigenvalue * eigenvector
  • there is more than one eigenvector solution
  • the eigenvectors of a square matrix are the non-zero vectors which, after being multiplied by the matrix, remain proportional to the original vector. For each eigenvector, the corresponding eigenvalue is the factor by which the eigenvector changes when
    multiplied by the matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Loading and Rotation in PCA?

A

Loading: Loadings represent the correlation between the original variables (features) and the principal components. After PCA, each principal component is a linear combination of the original variables. Loadings indicate the contribution of each variable to the construction of the principal component. Higher absolute loading values imply that the corresponding variable has a stronger influence on that specific principal component.

Rotation: PCA often generates uncorrelated principal components, which are orthogonal to each other. However, the resulting principal components may not have a clear interpretability, as they are linear combinations of the original variables. Rotation is an optional step performed after the initial PCA, known as “orthogonal rotation” or “varimax rotation.” The goal of rotation is to transform the original principal components to make them more interpretable and easier to understand.

n summary, loading and rotation in PCA help in understanding the relationship between the original variables and the principal components and can improve the interpretability of the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When to use PCA?

A
  • tool for exploratory data analysis
  • recognize patterns, outliers, trends, groups
  • for gene expression studies it find the genes that
    contributes most to the difference between the groups
  • reduces dimensionality
  • variables (genes) sum up with a certain degree to principal
    components

–> For PCA you need all values. No NA values, therefore you guess the missing values. e.g by using the mean of the rest values or the nearest neighbour method.

You can always start and end with PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly