Multivariate Statistik Flashcards Preview

Bioinformatic > Multivariate Statistik > Flashcards

Flashcards in Multivariate Statistik Deck (19):

correlation vs regression

• description of an undirected relationship between two or more variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths

• description of a directed relationship between two or more variables
• one variable influences the other
• smoking and cancer
• weight and height
• model to describe the
• model to predict one


The Coefficients

• How many variables to include?
• Akaike's\An Information Criterion (AIC)"
• we stop variable inclusion if AIC can't be decreased or R^2 can't be increased


Classication / Regression Trees

• for intermediate number of variables
• classication tree: predict categories (classes)
• regression tree: predict numerical values (class)
• such trees are also called decision trees
• any type of predictor variables can be used
• no linear relationships required


Good Cluster

High Quality Cluster with
-High intra-Class similarity
-Low Inter-class similarity

Depends on dist Mensure and Cluster Methoden:
Good: Smal circles, Long Lines
Bad: bog circles, small Lines


Similarity and distance: Variable: binary

Matching coeff.


Similarity and distance: Variable: categorical

Jaquard Dist.
Sij= a/(a+b+c)


Types of clustering: hierarchical vs partitional

Hierarchical Clustering: A set of nested clusters
organized as a hierarchical tree --> we will get a dendrogram and a cluster id by dendrogram cutting

Partitional Clustering: A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset --> we will only get a cluster id


steps at hierarchical clustering

• no need to specify numbers of cluster k before clustering starts
• the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
• on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects


Hard vs. soft (fuzzy) clustering

• Hard clustering algorithms:
- assign each pattern to a single cluster during operation
and output
- hclust, diana, kmeans
• Fuzzy clustering algorithms:
{ assign degrees of membership in several groups
{ fanny
{ fanny membership sub-object: soft clustering results
{ fanny clustering sub-object: hard clustering results



- Clusterzentren (K) werden zufällig gewählt (zuvor überlegen/festlegen, wie viele Clusterzentren man möchte)
- Jedes Element (Daten) wird dem nächsten Clusterzentrum zugeordnet
- Die Clusterzentren (Mittel des gesamten Clusters) wird neu gesetzt
o Wenn sich dadurch Änderungen ergaben, was die Zuordnung der einzelnen Werte angeht → neu zuordnen
➔ Das ganze so oft wiederholen, bis sich nichts mehr ändert
- Problem: Wenn die Punkte immer zufällig gesetzt werden, kann die Verteilung stark vom Startpunkt abhängen
➔ Man muss den Algorithmus oft laufen lassen und K gut schätzen können


es gibt drei Möglichkeiten, die Entfernung zum Cluster zu definieren (bzw. im Cluster) - welche?

average linkage, single linkage, complete linkage


average linkage:

- use the average distance value
--> average linkage to merge closest rows


complete linkage

o Nächste Entfernung zum weit entferntesten (bzw. größten) Punkt des ersten Clusters
o Problem: große Cluster nehmen selten neue Mitglieder auf


single linkage

{ use the smallest distance value
--> single linkage to merge closest rows


PCA (general)

• Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables (PC's).
• The first PC accounts for as much of the variability in the data as possible, each succeeding component accounts for as much of the remaining variability as possible.
• PCA is performend on a covariance or a correlation matrix
of your data.
• Use correlation matrix if variances of your variables differ
largely (scale=TRUE).
• Principal components (PC's) are linear combinations of the original variables weighted by their contribution to
explaining the variance orthogonal dimension.


PCA's usage

• tool for exploratory data analysis
• recognize patterns, outliers, trends, groups
• for gene expression studies it find the genes that contributes most to the difference between the groups
• reduces dimensionality
• variables (genes) sum up with a certain degree to principal components


understanding PCA geometrically and covariance matrix/eigenvektor

• Geometrically
- Rotation of space to maximize variance for fewer coordinates
• Covariance matrix and eigenvector
- Eigenvector with largest eigenvalue is first principal component (PC)


simple linear regression

- die Gerade durch die Daten (Residuen) muss so angepasst werden, dass die Streuung der umliegenden Datenpunkte minimal ist
➔ Fitten
- Kann man nur bei gleichverteilten, linearen Daten machen (Gerade durch Kurve legen macht keinen Sinn…)
-folgt dem Prinzip der maximalen likelihood : wahrscheinlichkeit, dass mein model meine Daten generiert soll maximal sein

(abhängig Variable= regressionskoeff.* unabhänige variable+intercept)


multiple lineare regression

• multiple regression (multiple predictor variables P,Q,R)
but one outcome
multiple coefficients:
Y = a + bP + cQ + dR