Multivariate Statistik Flashcards

Question 1

Q

correlation vs regression

Answer

A

Correlation
• description of an undirected relationship between two or more variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths

Regression
• description of a directed relationship between two or more variables
• one variable influences the other
• smoking and cancer
• weight and height
• model to describe the
relationship
• model to predict one
variable

Question 2

Q

The Coefficients

Answer

A

How many variables to include?
Akaike’s\An Information Criterion (AIC)”
we stop variable inclusion if AIC can’t be decreased or R^2 can’t be increased

Question 3

Q

Classication / Regression Trees

Answer

A

for intermediate number of variables
classication tree: predict categories (classes)
regression tree: predict numerical values (class)
such trees are also called decision trees
any type of predictor variables can be used
no linear relationships required

Question 4

Q

Good Cluster

Answer

A

High Quality Cluster with

High intra-Class similarity
Low Inter-class similarity

Depends on dist Mensure and Cluster Methoden:
Good: Smal circles, Long Lines
Bad: bog circles, small Lines

Question 5

Q

Similarity and distance: Variable: binary

Answer

A

Matching coeff.

Question 6

Q

Similarity and distance: Variable: categorical

Answer

A

Jaquard Dist.

Sij= a/(a+b+c)

Question 7

Q

Types of clustering: hierarchical vs partitional

Answer

A

Hierarchical Clustering: A set of nested clusters
organized as a hierarchical tree –> we will get a dendrogram and a cluster id by dendrogram cutting

Partitional Clustering: A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset –> we will only get a cluster id

Question 8

Q

steps at hierarchical clustering

Answer

A

no need to specify numbers of cluster k before clustering starts
the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects

Question 9

Q

Hard vs. soft (fuzzy) clustering

Answer

A

• Hard clustering algorithms:
- assign each pattern to a single cluster during operation
and output
- hclust, diana, kmeans
• Fuzzy clustering algorithms:
{ assign degrees of membership in several groups
{ fanny
{ fanny membership sub-object: soft clustering results
{ fanny clustering sub-object: hard clustering results

Question 10

Q

K-Means

Answer

A

Clusterzentren (K) werden zufällig gewählt (zuvor überlegen/festlegen, wie viele Clusterzentren man möchte)
Jedes Element (Daten) wird dem nächsten Clusterzentrum zugeordnet
Die Clusterzentren (Mittel des gesamten Clusters) wird neu gesetzt
o Wenn sich dadurch Änderungen ergaben, was die Zuordnung der einzelnen Werte angeht → neu zuordnen
➔ Das ganze so oft wiederholen, bis sich nichts mehr ändert
Problem: Wenn die Punkte immer zufällig gesetzt werden, kann die Verteilung stark vom Startpunkt abhängen
➔ Man muss den Algorithmus oft laufen lassen und K gut schätzen können

Question 11

Q

es gibt drei Möglichkeiten, die Entfernung zum Cluster zu definieren (bzw. im Cluster) - welche?

Answer

A

average linkage, single linkage, complete linkage

Question 12

Q

average linkage:

Answer

A

use the average distance value

- -> average linkage to merge closest rows

Question 13

Q

complete linkage

Answer

A

o Nächste Entfernung zum weit entferntesten (bzw. größten) Punkt des ersten Clusters
o Problem: große Cluster nehmen selten neue Mitglieder auf

Question 14

Q

single linkage

Answer

A

{ use the smallest distance value

–> single linkage to merge closest rows

Question 15

Q

PCA (general)

Answer

A

• Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables (PC’s).
• The first PC accounts for as much of the variability in the data as possible, each succeeding component accounts for as much of the remaining variability as possible.
• PCA is performend on a covariance or a correlation matrix
of your data.
• Use correlation matrix if variances of your variables differ
largely (scale=TRUE).
• Principal components (PC’s) are linear combinations of the original variables weighted by their contribution to
explaining the variance orthogonal dimension.

Question 16

Q

PCA’s usage

Answer

Study These Flashcards

A

tool for exploratory data analysis
recognize patterns, outliers, trends, groups
for gene expression studies it find the genes that contributes most to the difference between the groups
reduces dimensionality
variables (genes) sum up with a certain degree to principal components

Question 17

Q

understanding PCA geometrically and covariance matrix/eigenvektor

Answer

Study These Flashcards

A

• Geometrically
- Rotation of space to maximize variance for fewer coordinates
• Covariance matrix and eigenvector
- Eigenvector with largest eigenvalue is first principal component (PC)

Question 18

Q

simple linear regression

Answer

Study These Flashcards

A

die Gerade durch die Daten (Residuen) muss so angepasst werden, dass die Streuung der umliegenden Datenpunkte minimal ist
➔ Fitten
Kann man nur bei gleichverteilten, linearen Daten machen (Gerade durch Kurve legen macht keinen Sinn…)
-folgt dem Prinzip der maximalen likelihood : wahrscheinlichkeit, dass mein model meine Daten generiert soll maximal sein

y=ax+b
(abhängig Variable= regressionskoeff.* unabhänige variable+intercept)

Question 19

Q

multiple lineare regression

Answer

Study These Flashcards

A

• multiple regression (multiple predictor variables P,Q,R)
but one outcome
multiple coefficients:
Y = a + bP + cQ + dR

Multivariate Statistik Flashcards

(19 cards)