Model Selection: Summarizing and Visualizing Training Samples Flashcards

Question 1

Q

Why do we need to characterize the training samples?

Answer

A

for features selection and construction
for choosing a proper model class and its complexity,
for preprocessing the samples e.g. standardization (each component has zero mean and unit variance) or whitening (covariance structure is removed).
for detecting redundancy in the samples.
to indicate how many prototypes or typical cases are present

Question 2

Q

What is univariate data?

Answer

A

Univariate data is only a set of numbers, that is, a set of scalars. Each number corresponds to an observation.

Question 3

Q

What are measures of center when it comes to univariate data?

Answer

A

measures of the center:
mean: arithmetic average sums up all numbers and divides sum by the number of numbers.
median: the middle value in the sorted list of numbers.
mode: the value that occurs most often.
mid-range: half of the range (max minus min value).

Mathematical characteristics
the mean minimizes the average squared deviation (L2-norm),
the median minimizes average absolute deviation (L1-norm),
the mid-range (0.5 times the range) minimizes the maximum absolute deviation (L∞norm).

Question 4

Q

What are the three types of Distribution?

Answer

A

symmetric distributions: mean = median.
Gaussian distribution: mean and median should be estimated by the empirical mean.
Laplace distribution: mean and median should be estimated by the empirical median.

Question 5

Q

How can we measure variability?

Answer

A

range: largest sample minus smallest sample.
variance: average deviation from the sample mean computed as the square root of the average squared deviation from the sample mean.
quantiles: the value for which a certain percentage of samples is larger or is smaller. Quartiles use 25%, 50% (median), and 75%.

Question 6

Q

The default statistics, which are shown for the boxplot are:

Answer

A

the median as horizontal bar
the box ranging from the lower to the upper quartile
whiskers range from the maximal to the minimal observations without the outliers
outliers as points; outliers are observations that have larger deviation than fact times the interquartile range from the upper or lower quartile. In R default is fact=1.5.

Question 7

Q

Explain Histograms

Answer

A

Histogram: graphical representation of the data distribution which shows tabulated frequencies as adjacent rectangles which erect over discrete intervals (bins).
* area of the rectangle: equal to the frequency of the observations in the interval
* equidistant bins: heights of the rectangles proportional to frequency of the observations

Histograms help to assess:
* spread or variation
* general shape
* peaks
* low density regions
* outliers informative overview of the observations via histogram with R command hist()

Question 8

Q

What happens when smoothing the surface of a histogram

Answer

A

Smoothing the surface of a histogram leads to a probability density function. That is, if samples go to infinity and the bin widths to zero. In general, probability density functions are obtained by kernel density estimation (KDE) which is a non-parametric (except for the bandwidth) method also called Parzen-Rosenblatt window method.

Question 9

Q

What is the most tricky part of a KDE?

Answer

A

The most tricky part of KDE is the bandwidth selection:
* too small: many peaks and wiggly (overfitting)
* too large: peaks vanish and no details (underfitting)

For Gaussian kernels rule-of-thumb (Silverman’s rule).
The closer the true density to a Gaussian, the better the estimation.

Question 10

Q

What is a violin plot?

Answer

A

combining boxplot and density estimation a rotated kernel density at each side of boxplot

Question 11

Q

How does bivariate data work?

Answer

A

bivariate data: two scalar variables, pairs of data points
x: response, dependent variable, target, output, label
y: explanatory variable, independent variable, regressor, feature

response is caused by explanatory variable -> causality
statistical or machine learning methods cannot determine causality

Question 12

Q

How does a scatter plot work?

Answer

A

A scatter plot shows each observation as a point, where the x-coordinate is the value of the first variable and the y-coordinate is the value of the second variable.

Question 13

Q

How does linear dependence work in scatter plots?

Answer

A

two variables linearly dependent: points are on a line
two variables linearly dependent to some degree: points at a line
the more points are on a line, the higher the linear dependence

Question 14

Q

For bivariate data, what is the measure of the linear correlation between to variables?

Answer

A

For the bivariate data (x1,y1),(x2,y2),…,(xN,yN) Pearson’s sample correlation coefficient r is a measure of the linear correlation (dependence) between the two variables x and y

or a perfect linear dependency the correlation coefficient is r = 1 or r = −1.

Question 15

Q

For bivariate data, what is the test of independence between to variables?

Answer

A

a test for a correlation coefficient of ρ = 0. The test is a t-test with the null hypothesis H0 that ρ = 0. The test is only valid if both variables are drawn from a normal distribution.

Question 16

Q

What does linear regression do?

Answer

Study These Flashcards

A

Linear regression: fit a line to bivariate data
Extract information about the relation of the two variables y and x. functional relationship: y = a + b x

Question 17

Q

What are some unsupervised methods to summarize multivariate samples?

Answer

Study These Flashcards

A

principal component analysis
independent component analysis
factor analysis
projection pursuit
k-means clustering
hierarchical clustering
mixture models: Gaussian mixtures
self-organizing maps
kernel density estimation
hidden Markov models
Markov networks (Markov random fields)
restricted Boltzmann machines
neural network: auto-associators, unsupervised deep nets

Question 18

Q

What are some descriptive methods to summarize multivariate data?

Answer

Study These Flashcards

A

Projection methods:
* new representation of objects
* down-projection into lower-dimensional space: keeps the neighborhoods
* finding structure in the data

principal component analysis, multidimensional scaling

descriptive:
* map to lower dimensional space
* compact and non-redundant data storage or transmission
* data visualization
* feature selection
* preprocessing methods for subsequent data analysis
* descriptive model with unique inverse -> generative framework which selects inverse model, e.g. density estimation
* descriptive model without inverse -> principal curves, multidimensional scaling

Question 19

Q

What are some generative methods to summarize multivariate data?

Answer

Study These Flashcards

A

Generative models:
* build a model of the observed data
* match the observed data density

density estimation, factor analysis, independent component analysis, generative topographic mapping

generative:
* model or to simulate the real world
* model samples same distribution as the real world observations
* describe the data generation process

Advantages of generative models:
* determining model parameters e.g. calcium concentration, reaction rate, distribution of channels
* generating new simulated observations,
* simulating in unknowns regimes, e.g. new parameters,
* assessing the noise and the signal in the data
* supplying distributions and error bars for latent variables
* detection of outliers as very unlikely observations,
* detection and correction of noise in the observations

Model Selection: Summarizing and Visualizing Training Samples Flashcards

(19 cards)