1) Data Science Practice Flashcards by Marrick Lip

What is tidy data?

There is one row per observation and one column per variable

How well did you know this?

Not at all

Perfectly

What are the basic tidyverse functions?

Gather, spread, separate and unite

How well did you know this?

Not at all

Perfectly

What is ‘gather’ used for?

When one variable is spread across columns

How well did you know this?

Not at all

Perfectly

What is ‘spread’ used for?

When one observation is split across rows

How well did you know this?

Not at all

Perfectly

What is ‘separate’ used for?

When one cell contains more than one value

How well did you know this?

Not at all

Perfectly

What is ‘unite’ used for?

When a value is split over multiple cells

How well did you know this?

Not at all

Perfectly

What does the ‘summarise’ function do?

Compute aggregates based on the columns

How well did you know this?

Not at all

Perfectly

How are missing values in a join handled?

They are represented as NA

How well did you know this?

Not at all

Perfectly

How does mapping in R work?

Map produces a list, map_dbl etc. returns a vector

use na.rm=TRUE to skip NA values

How well did you know this?

Not at all

Perfectly

How can linear regression be intuitively constructed?

By taking the average gradient between every pair of points, weighted based on their squared distance.

This is called “ordinary least squares regression”

How well did you know this?

Not at all

Perfectly

What are the assumptions for ordinary least squares regression?

Training data is representative as n approaches infinity
Does NOT assume that alpha and beta are normally distrbuted
Interval predictions require assuming that Y | X=x has a linear distribution
Equal variance

How well did you know this?

Not at all

Perfectly

How can the equality of variance assumption be side-stepped?

-Re-sampling take samples from the data set with replacement

The “sandwich estimator” estimates variance by inferring the co-variance

How well did you know this?

Not at all

Perfectly

What are the advantages of harmonic functions?

They are simple, exactly periodic, can be easily extrapolated.

Due to the Fourier Series, they require few terms to represent smooth changes

How well did you know this?

Not at all

Perfectly

What are the disadvantages of harmonic functions?

They are exactly periodic so can’t account for variation between seasons

They require a lot of terms for sharp changes

How well did you know this?

Not at all

Perfectly

What are splines?

Piece-wise polynomials defined on intervals called knots.

A spine of order d has d-1 continuous derivatives at the knots

They cannot be extrapolated to the future

How well did you know this?

Not at all

Perfectly

Compare linear splines and cubic splines?

Study These Flashcards

Cubic splines have a smooth appearance

Linear splines have interpretable coefficients

What is the apparent error?

Study These Flashcards

An estimate of MSPE based on the residual variance from fitting the model.

It would be unbiased only if the model was fitted without looking at the data.

At a high-level, what leads to overfitting?

Study These Flashcards

Adding more variables will non-strictly decrease the apparent error

How is overfitting prevented?

Study These Flashcards

With a “complexity penalty”

What is AIC?

Study These Flashcards

Akaike’s information criteria is a measure of how bad a model is:

AIC n log(RSS) + 2p

How can an unbiased estimator be provided without wasting data?

Study These Flashcards

Use cross-validation or bootstrapping

What is the process for cross-validation?

Study These Flashcards

Divide the data into k folds
Train the model on k-1 folds and predict on the k_th; compute the squared error for each observation in hold-out set
Repeat, leaving each fold out in turn
Average all of the squared errors to give MSPE

Note: a new model is fitted for each fold

How can the best model be found?

Study These Flashcards

Regsubsets uses “backwards step-wise search” to find the best model for each number of variables

What is a meta-strategy?

Study These Flashcards

A tuning parameter is used to optimise the model fitting strategy.

This is done by choosing lambda which gives the smallest MSPE under cross-validation and using it for the whole dataset

What is bootstrapping?

Use random samples with replacement to give unbiased estimates about the sample

What is optimism?

The expected value of the difference between the prediction error of the sample (apparent error) and the true prediction error: Opt = E[MSPE_true - MSPE_sample] Bootstrap optimism: Opt* = E[MSPE_sample-MSPE_boot]

What is the overall approach to model fitting?

1) Look at the data and find sensible transformations (feature engineering)

What is regularisation?

Allowing variables to be partly in the model

How can missing values be handled for regression?

Re-weighting or imputation (fit another model to predict missing values)

What is a linear discriminant analysis?

Classification based on X = alphaY + epsilon It assumes that the distribution of X is multivariate normal and each group has the same variance matrix but different means

What is a ROC curve?

It gives a tradeoff between false positive rates (x-axis) and false negative rates (y-axis)

1) Data Science Practice Flashcards

(31 cards)