Session 6 - Model Assessment Flashcards

Question

What does a prediction model usually perform better in? Why is this a problem?

Answer 1

The sample used to develop the model (development or training sample) than in other samples, even if those samples are derived from the same population. Problem as: - We overestimate the predictive ability of our model - Model performance is over-optimistic

Answer 2

Assessing the assumptions of our model does not allow us to assume that it is going to work well on data that it has not seen before! In other words, a high r2 (explained variance) in our training sample does not allow us to conclude that our model will have the desired performance if we apply it to new cases. 2. Some kind of assurance of the accuracy of the predictions that our model is putting out! - need to validate our model. To evaluate the performance of any prediction model, we need to test it on some unseen data.

Answer 3

Models performance on unseen data

Answer 4

Apparent validity Internal validity External Validity

Answer 5

Mean Squared Error (MSE): 𝑀𝑆𝐸= (∑(𝑌𝑖 ̂− 𝑌𝑖)2/𝑛 with 𝑌 ̂𝑖= estimated Y for case i Yi = observed Y for case i N = number of cases in hold out sample MSE is an estimate of the expected mean squared error E" (𝑌−𝑌 ̂)2 in the population!

Answer 6

∑(𝑌𝑖 ̂− 𝑌𝑖)2 = e12 + e22+ e32... = Residual Sum of Squares (RSS) RSS – add up square of each residuals, sum of predicted minus observed squared M𝑆𝐸= 𝑅𝑆𝑆/𝑛

Answer 7

Aim of prediction modelling is to minimize MSE of unseen cases, unlike inferential modelling where we want to minimize the MSE of the apparent (seen) cases (Least square method of linear regression).

Answer 8

MSE is an absolute measure of prediction error and sometimes difficult to interpret Therefore present root mean square error (RMSE) = √𝑀𝑆𝐸 Represents the SD of the differences between predicted and observed values Still often not easy to interpret

Answer 9

A relative measure: - unexplained variance (error variance) or explained variance r2 of unseen cases - ranges form 0% to 100% - same as already learnt for linear regression (but calculated using the same or apparent dataset! )

Answer 10

RSS = Total sum of squared error (observed - predicted)2 TSS = Total squared variation of observed outcomes Y: Total sum of squares (TSS) = ∑(𝑌𝑖− 𝑌 ̅) with 𝑌 ̅ = mean of all Y’s Unexplained variance = 𝑅𝑆𝑆/𝑇𝑆𝑆 =(𝐸𝑟𝑟𝑜𝑟 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒)/(𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑌) Explained variance of our model R2: 1- unexplained variance or R2= 1−𝑅𝑆𝑆/𝑇𝑆𝑆 =(𝑇𝑆𝑆−𝑅𝑆𝑆)/𝑇𝑆𝑆

Answer 11

One which minimizes the MSE or equivalent maximises R2 of unseen cases

Answer 12

Model which minimizes the MSE or equivalent maximizes R2 of apparent cases (same cases as used for model building)!

Answer 13

Testing the model on the whole original model development sample.

Answer 14

Reproducibility of the model for the underlying population and setting where the development sample originated from “Reproducibility”

Answer 15

Validation Set (or Hold-out or Split-Sample Validation) k-fold Cross Validation Bootstrap Validation

Answer 16

Generalizability of the model to populations that are plausibly related (clinical population of interest, or different ethnicity etc ). Testing the model on new subjects Assess transportability rather than reproducibility of a model

Answer 17

When simply determined on the sample of subjects that was used to construct the model.

Answer 18

Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. Hold-out sample N-fold cross validation bootstrapping

Answer 19

a) with and b) without model selection

Answer 20

A model’s predictive capacity can only be assessed when the model is tested on an independent data set that was not used to develop the model. That is, there are two data sets: 1. A training set on which the model is derived (and which may be further split up in a training and a validation set for model selection) and 2. A test set, an independent testing set on which the model is evaluated.

Answer 21

Data used to develop the model.

Answer 22

Data used to test the developed model.

Answer 23

Error that results from applying the prediction model developed on the training set to the same training set.

Answer 24

Expected error that results from applying the developed prediction model on a new observation.

Answer 25

The simplest approach to estimate prediction accuracy is to randomly split the data set into two. In the larger set a model will be trained, such as estimating the parameters for a regression. In the smaller test data set we then assess (test) the model by estimating the prediction accuracy. In larger dataset, typically 70% of data, we will estimate parameter for our regression model and we have our regression model, test outcome in test data set by estimating prediction accuracy Thus we develop our model, estimate regression coefficients and then predict using model the outcome for people in test data set and calculate MSE and R2 and

Answer 26

Simple Easy to implement in any software

Answer 27

Validation result often depends on split: The test set error can be highly variable Inefficient: Only a subset of observations are used to fit the model - thereby not convenient if have small dataset

Answer 28

Data set is split by some clustering factor, e.g. study site, geographic location, time, centre, etc. Preferable to normal split sampling approach because the test data set is more likely to be different than the training data set In small samples, similar problems arise as in the random split-sample and should be only used if training sample size is large

Answer 29

1. Split-sample approach 2. Better: Resampling methods (Harrell, 2015, Hastie et al., 2009) - - repeatedly drawing samples from a training set - refitting a model of interest on each sample in order to obtain more information about the fitted model All data are used for model development to improve statistical efficiency Internal validation is done via cross-validation or bootstrapping e.g., n- fold cross validation

Answer 30

- Usually recommended method - Use whole data set to fit model and then n- fold cross validation to estimate our prediction accuracy. - This is pursued by dividing single available dataset randomly into n folds (equal subsets) into equal subsets, typically 5 to 4 - In turn each fold is used as the unseen data (test set) with the remaining n-1 folds pooled together as the training set. - Prediction performance is the average over the n-folds. - Difference between apparent and internal performance is optimism - Often the n-fold CV is repeated cross several times (i.e. 50 times) and the average of all repeats is used to get a more stable estimate - The parameter for the model are estimated using the whole data set! - Cross-validation simulates replication attempts to get an honest estimate of real-world performance - Will have 5 slightly different MSE or R2. Take average of MSE of five folds or average of R2 and this is our estimate of performance of model in unseen cases of the same population. This is our estimated prediction accuracy

Answer 31

5-10 folds

Answer 32

If no model selection or model tuning (i.e. identifying the best lambda, see next lecture) is performed, for example: - Theory driven models - Models with a small number of predictors relative to sample size

Answer 33

1. Choosing the parameters based on maximizing prediction accuracy using CV biases the model to the dataset, yielding an overly-optimistic score! - This is because testing data are part of the model building process! 2. Nested CV estimates the generalization error of the underlying model and its (hyper)parameter search! Nested CV efficiently uses a series of train/validation/test set splits.

Answer 34

Bootstrapping: take a bootstrap sample (sample with replacement): 1. Fit your model in your bootstrap sample (on average 63.2% of our data) 2. Assess your model in the cases not selected by the bootstrap (on average: 36.8% of our data): Predict outcome and calculate the MSE 3. Repeat the bootstrap 100-1000 times and average the MSEs and use this as an estimate of internal validity! 4. Again, the final model is estimated using the whole data set! Better bootstrap approaches are available (i.e. 0.632+)

Answer 35

Research and simulations showed that that split-sample analyses gave overly pessimistic estimates of performance, with large variability. Also insufficient as don’t use whole dataset to estimate parameters 5-10 cross-validation usually has low bias and low variability and often performs best. Bootstrapping performs similar well to k-fold cross validation unless sample size is small - thereby Usually doesn’t matter if use k fold cross validation or bootstrapping but cross validation typically preferred

Answer 36

1. How the results will generalize to an independent data set of the same population (internal validation) 2. However, internal validation may not be sufficient and indicative for the model's performance in future patients. Problem: Our sample is usually not a random sample of the clinical population.

Answer 37

1. External validation 2. By a completely new data set collected at a different time, different location and ideally by different researchers -> Assessing the clinical population of interest!

Answer 38

Temporal validation – e.g. same investigators, cross-validate in recent years Spatial validation (other place) – e.g. same investigators, cross-validate in centres Fully external – e.g. other investigators, other centre , different times

Answer 39

Optimism = Apparent validation estimate – Internal validation estimate

Answer 40

Due to optimism

Answer 41

Optimism of our performance measures derived from the apparent data set and get (nearly) unbiased performance estimates for prediction of new cases of the same population.