1) Data Science Practice Flashcards

(31 cards)

1
Q

What is tidy data?

A

There is one row per observation and one column per variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the basic tidyverse functions?

A

Gather, spread, separate and unite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is ‘gather’ used for?

A

When one variable is spread across columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is ‘spread’ used for?

A

When one observation is split across rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ‘separate’ used for?

A

When one cell contains more than one value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is ‘unite’ used for?

A

When a value is split over multiple cells

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the ‘summarise’ function do?

A

Compute aggregates based on the columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are missing values in a join handled?

A

They are represented as NA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does mapping in R work?

A

Map produces a list, map_dbl etc. returns a vector

use na.rm=TRUE to skip NA values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can linear regression be intuitively constructed?

A

By taking the average gradient between every pair of points, weighted based on their squared distance.

This is called “ordinary least squares regression”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the assumptions for ordinary least squares regression?

A
  • Training data is representative as n approaches infinity
  • Does NOT assume that alpha and beta are normally distrbuted
  • Interval predictions require assuming that Y | X=x has a linear distribution
  • Equal variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can the equality of variance assumption be side-stepped?

A

-Re-sampling take samples from the data set with replacement

The “sandwich estimator” estimates variance by inferring the co-variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the advantages of harmonic functions?

A

They are simple, exactly periodic, can be easily extrapolated.

Due to the Fourier Series, they require few terms to represent smooth changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the disadvantages of harmonic functions?

A

They are exactly periodic so can’t account for variation between seasons

They require a lot of terms for sharp changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are splines?

A

Piece-wise polynomials defined on intervals called knots.

A spine of order d has d-1 continuous derivatives at the knots

They cannot be extrapolated to the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Compare linear splines and cubic splines?

A

Cubic splines have a smooth appearance

Linear splines have interpretable coefficients

17
Q

What is the apparent error?

A

An estimate of MSPE based on the residual variance from fitting the model.

It would be unbiased only if the model was fitted without looking at the data.

18
Q

At a high-level, what leads to overfitting?

A

Adding more variables will non-strictly decrease the apparent error

19
Q

How is overfitting prevented?

A

With a “complexity penalty”

20
Q

What is AIC?

A

Akaike’s information criteria is a measure of how bad a model is:

AIC n log(RSS) + 2p

21
Q

How can an unbiased estimator be provided without wasting data?

A

Use cross-validation or bootstrapping

22
Q

What is the process for cross-validation?

A
  • Divide the data into k folds
  • Train the model on k-1 folds and predict on the k_th; compute the squared error for each observation in hold-out set
  • Repeat, leaving each fold out in turn
  • Average all of the squared errors to give MSPE

Note: a new model is fitted for each fold

23
Q

How can the best model be found?

A

Regsubsets uses “backwards step-wise search” to find the best model for each number of variables

24
Q

What is a meta-strategy?

A

A tuning parameter is used to optimise the model fitting strategy.

This is done by choosing lambda which gives the smallest MSPE under cross-validation and using it for the whole dataset

25
What is bootstrapping?
Use random samples with replacement to give unbiased estimates about the sample
26
What is optimism?
The expected value of the difference between the prediction error of the sample (apparent error) and the true prediction error: Opt = E[MSPE_true - MSPE_sample] Bootstrap optimism: Opt* = E[MSPE_sample-MSPE_boot]
27
What is the overall approach to model fitting?
1) Look at the data and find sensible transformations (feature engineering)
28
What is regularisation?
Allowing variables to be partly in the model
29
How can missing values be handled for regression?
Re-weighting or imputation (fit another model to predict missing values)
30
What is a linear discriminant analysis?
Classification based on X = alphaY + epsilon It assumes that the distribution of X is multivariate normal and each group has the same variance matrix but different means
31
What is a ROC curve?
It gives a tradeoff between false positive rates (x-axis) and false negative rates (y-axis)