1) Data Science Practice Flashcards
(31 cards)
What is tidy data?
There is one row per observation and one column per variable
What are the basic tidyverse functions?
Gather, spread, separate and unite
What is ‘gather’ used for?
When one variable is spread across columns
What is ‘spread’ used for?
When one observation is split across rows
What is ‘separate’ used for?
When one cell contains more than one value
What is ‘unite’ used for?
When a value is split over multiple cells
What does the ‘summarise’ function do?
Compute aggregates based on the columns
How are missing values in a join handled?
They are represented as NA
How does mapping in R work?
Map produces a list, map_dbl etc. returns a vector
use na.rm=TRUE to skip NA values
How can linear regression be intuitively constructed?
By taking the average gradient between every pair of points, weighted based on their squared distance.
This is called “ordinary least squares regression”
What are the assumptions for ordinary least squares regression?
- Training data is representative as n approaches infinity
- Does NOT assume that alpha and beta are normally distrbuted
- Interval predictions require assuming that Y | X=x has a linear distribution
- Equal variance
How can the equality of variance assumption be side-stepped?
-Re-sampling take samples from the data set with replacement
The “sandwich estimator” estimates variance by inferring the co-variance
What are the advantages of harmonic functions?
They are simple, exactly periodic, can be easily extrapolated.
Due to the Fourier Series, they require few terms to represent smooth changes
What are the disadvantages of harmonic functions?
They are exactly periodic so can’t account for variation between seasons
They require a lot of terms for sharp changes
What are splines?
Piece-wise polynomials defined on intervals called knots.
A spine of order d has d-1 continuous derivatives at the knots
They cannot be extrapolated to the future
Compare linear splines and cubic splines?
Cubic splines have a smooth appearance
Linear splines have interpretable coefficients
What is the apparent error?
An estimate of MSPE based on the residual variance from fitting the model.
It would be unbiased only if the model was fitted without looking at the data.
At a high-level, what leads to overfitting?
Adding more variables will non-strictly decrease the apparent error
How is overfitting prevented?
With a “complexity penalty”
What is AIC?
Akaike’s information criteria is a measure of how bad a model is:
AIC n log(RSS) + 2p
How can an unbiased estimator be provided without wasting data?
Use cross-validation or bootstrapping
What is the process for cross-validation?
- Divide the data into k folds
- Train the model on k-1 folds and predict on the k_th; compute the squared error for each observation in hold-out set
- Repeat, leaving each fold out in turn
- Average all of the squared errors to give MSPE
Note: a new model is fitted for each fold
How can the best model be found?
Regsubsets uses “backwards step-wise search” to find the best model for each number of variables
What is a meta-strategy?
A tuning parameter is used to optimise the model fitting strategy.
This is done by choosing lambda which gives the smallest MSPE under cross-validation and using it for the whole dataset