VL 11 Flashcards

Question 1

Q

What is multidimensional scaling?

Answer

A

visualization technique for multidimensional data
requires distance matrix, no original variable information
can be retrieved later
idea: visualize all pairwise object dissimilarities as good as possible in a lower dimensional space (often 2D)
example of road-distances between European countries
visualized in a 2D plot
difference between visual representation on distance
matrix: stress
usually you use a distance map from a distance table to visualise data. If you don’t have a distance table create one dist.name (function in R)

Question 2

Q

Similarity and distance measures

Answer

A

evaluate similarity/distance between two objects
put similar objects together, disimilar apart
similarity or distance measure required
Conventions:
– Sij = 1 : i and j are identical in their properties
– Sij = 0 : i and j are different in all properties

Question 3

Q

Continuous Data Distance

Answer

A

1 dimension
Euclidean = = Manhatten

2 dimensional
- Euclidean (direkter weg zwischen zwei punkten)
- Manhattan (Erst Y dann X)

n Dimensions
- Minkowski Distance (generalisation of exponent e)
- Canberra Distance (weighted version of Manhattan)
- Pearson-Correlation distance

–> All distance maps should give similar results, always check two (mostly check e.g Euclidean and Correlation)

Alle formeln auf Spicker..

Question 4

Q

Numeric Data and categorical Data and distance methods.

Answer

A

numerical data – Euclidean
– Manhattan
– Correlation (scale invariant) * binary data
– Jaccard
– Matching coefficient – (Manhattan)
sequence data
– edit distances (Levenshtein, Needleman-Wunsch) – 1 - (percent similarity/100)

Question 5

Q

What is PCA?

Answer

A

PCA (Principal Component Analysis) in bioinformatics is a statistical method used to reduce the dimensionality of high-dimensional biological data while retaining its important patterns and variability. It helps researchers analyze and visualize large datasets and identify significant features in biological samples.

Question 6

Q

What does imputing data mean?

Answer

A

Imputing basically means to use regression to predict
missing value
imputation works quite ok if we have not too many NA’s (less than 5-10%)

Question 7

Q

What’s the diffrence between mathematically PCA and geographical PCA?

Answer

A

Geometrically
– Rotation of space to maximize variance for fewer
coordinates
-PCA is used to visualise Data in a lower dimensional State
PCA-Maths
Covariance matrix and eigenvector
– Eigenvector with largest eigenvalue is first principal component (PC)

Question 8

Q

What’s a covariance matrix?

Answer

A

variables a, b, c, d
square matrix of covariances
in the diagonals are the variances * cov(a,a) == var(a)

Question 9

Q

Explain Eigenvector and Eigenvalue

Answer

A

square matrix = eigenvalue * eigenvector
there is more than one eigenvector solution
the eigenvectors of a square matrix are the non-zero vectors which, after being multiplied by the matrix, remain proportional to the original vector. For each eigenvector, the corresponding eigenvalue is the factor by which the eigenvector changes when
multiplied by the matrix

Question 10

Q

Loading and Rotation in PCA?

Answer

A

Loading: Loadings represent the correlation between the original variables (features) and the principal components. After PCA, each principal component is a linear combination of the original variables. Loadings indicate the contribution of each variable to the construction of the principal component. Higher absolute loading values imply that the corresponding variable has a stronger influence on that specific principal component.

Rotation: PCA often generates uncorrelated principal components, which are orthogonal to each other. However, the resulting principal components may not have a clear interpretability, as they are linear combinations of the original variables. Rotation is an optional step performed after the initial PCA, known as “orthogonal rotation” or “varimax rotation.” The goal of rotation is to transform the original principal components to make them more interpretable and easier to understand.

n summary, loading and rotation in PCA help in understanding the relationship between the original variables and the principal components and can improve the interpretability of the results.

Question 11

Q

Question 12

Q

When to use PCA?

Answer

A

tool for exploratory data analysis
recognize patterns, outliers, trends, groups
for gene expression studies it find the genes that
contributes most to the difference between the groups
reduces dimensionality
variables (genes) sum up with a certain degree to principal
components

–> For PCA you need all values. No NA values, therefore you guess the missing values. e.g by using the mean of the rest values or the nearest neighbour method.

You can always start and end with PCA

Brainscape's Knowledge GenomeTM

VL 11 Flashcards

Brainscape's Knowledge Genome^TM