Multivariate Statistik Flashcards Preview

Bioinformatic > Multivariate Statistik > Flashcards

Flashcards in Multivariate Statistik Deck (19):
1

correlation vs regression

Correlation
• description of an undirected relationship between two or more variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths

Regression
• description of a directed relationship between two or more variables
• one variable influences the other
• smoking and cancer
• weight and height
• model to describe the
relationship
• model to predict one
variable

2

The Coefficients

• How many variables to include?
• Akaike's\An Information Criterion (AIC)"
• we stop variable inclusion if AIC can't be decreased or R^2 can't be increased

3

Classication / Regression Trees

• for intermediate number of variables
• classication tree: predict categories (classes)
• regression tree: predict numerical values (class)
• such trees are also called decision trees
• any type of predictor variables can be used
• no linear relationships required

4

Good Cluster

High Quality Cluster with
-High intra-Class similarity
-Low Inter-class similarity

Depends on dist Mensure and Cluster Methoden:
Good: Smal circles, Long Lines
Bad: bog circles, small Lines

5

Similarity and distance: Variable: binary

Matching coeff.

6

Similarity and distance: Variable: categorical

Jaquard Dist.
Sij= a/(a+b+c)

7

Types of clustering: hierarchical vs partitional

Hierarchical Clustering: A set of nested clusters
organized as a hierarchical tree --> we will get a dendrogram and a cluster id by dendrogram cutting

Partitional Clustering: A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset --> we will only get a cluster id

8

steps at hierarchical clustering

• no need to specify numbers of cluster k before clustering starts
• the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
• on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects

9

Hard vs. soft (fuzzy) clustering

• Hard clustering algorithms:
- assign each pattern to a single cluster during operation
and output
- hclust, diana, kmeans
• Fuzzy clustering algorithms:
{ assign degrees of membership in several groups
{ fanny
{ fanny membership sub-object: soft clustering results
{ fanny clustering sub-object: hard clustering results

10

K-Means

- Clusterzentren (K) werden zufällig gewählt (zuvor überlegen/festlegen, wie viele Clusterzentren man möchte)
- Jedes Element (Daten) wird dem nächsten Clusterzentrum zugeordnet
- Die Clusterzentren (Mittel des gesamten Clusters) wird neu gesetzt
o Wenn sich dadurch Änderungen ergaben, was die Zuordnung der einzelnen Werte angeht → neu zuordnen
➔ Das ganze so oft wiederholen, bis sich nichts mehr ändert
- Problem: Wenn die Punkte immer zufällig gesetzt werden, kann die Verteilung stark vom Startpunkt abhängen
➔ Man muss den Algorithmus oft laufen lassen und K gut schätzen können

11

es gibt drei Möglichkeiten, die Entfernung zum Cluster zu definieren (bzw. im Cluster) - welche?

average linkage, single linkage, complete linkage

12

average linkage:

- use the average distance value
--> average linkage to merge closest rows

13

complete linkage

o Nächste Entfernung zum weit entferntesten (bzw. größten) Punkt des ersten Clusters
o Problem: große Cluster nehmen selten neue Mitglieder auf

14

single linkage

{ use the smallest distance value
--> single linkage to merge closest rows

15

PCA (general)

• Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables (PC's).
• The first PC accounts for as much of the variability in the data as possible, each succeeding component accounts for as much of the remaining variability as possible.
• PCA is performend on a covariance or a correlation matrix
of your data.
• Use correlation matrix if variances of your variables differ
largely (scale=TRUE).
• Principal components (PC's) are linear combinations of the original variables weighted by their contribution to
explaining the variance orthogonal dimension.

16

PCA's usage

• tool for exploratory data analysis
• recognize patterns, outliers, trends, groups
• for gene expression studies it find the genes that contributes most to the difference between the groups
• reduces dimensionality
• variables (genes) sum up with a certain degree to principal components

17

understanding PCA geometrically and covariance matrix/eigenvektor

• Geometrically
- Rotation of space to maximize variance for fewer coordinates
• Covariance matrix and eigenvector
- Eigenvector with largest eigenvalue is first principal component (PC)

18

simple linear regression

- die Gerade durch die Daten (Residuen) muss so angepasst werden, dass die Streuung der umliegenden Datenpunkte minimal ist
➔ Fitten
- Kann man nur bei gleichverteilten, linearen Daten machen (Gerade durch Kurve legen macht keinen Sinn…)
-folgt dem Prinzip der maximalen likelihood : wahrscheinlichkeit, dass mein model meine Daten generiert soll maximal sein

y=ax+b
(abhängig Variable= regressionskoeff.* unabhänige variable+intercept)

19

multiple lineare regression

• multiple regression (multiple predictor variables P,Q,R)
but one outcome
multiple coefficients:
Y = a + bP + cQ + dR