Evaluation Flashcards

Question 1

Q

How did you conduct posterior predictive checking?

Answer

A

performed posterior predictive model checking through generating replicate datasets using the posterior estimates of each parameter for each iteration of the model
compared this with summary statistics (e.g. mean, median, sd) to the original data summary statistics
aim is that the simulated summary statistics match the original data statistics, showing that it is a sensible / plausible value of the distribution of the simulated data
visually examined this through density plots identifying the distribution of the replicate values and how the original data fit within this

Question 2

Q

How did you design and interpret your simulation experiments?

Answer

A

posterior predictive checks were designed by creating replicate datasets to identify is these matched with the original data values
examined these using different measures to assess how well the replicate values matched with the original values (e.g. MAE, RMSE)
allowed comparison with other models to examine how well my approach was performing in comparison

Question 3

Q

What are the strengths and weaknesses of your evaluation metrics?

Answer

A

Brier Score: balances both accuracy and confidence, but can be sensitive to class imbalance
Expected Calibration Error (ECE): effectively captures probabilistic calibration, but requires binning and may be unstable in sparse datasets
Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE): provide intuitive measures of prediction error, but does not reflect uncertainty or probabilistic structure, need interpreted alongside other metrics
Bayesian Coverage of Uncertainty Intervals: validates that the model’s uncertainty quantification is well calibrated, but can be difficult to interpret in high dimensions and misleading if not interpreted alongside another measure such as width of intervals

Question 4

Q

How confident are you in your models’ generalisability?

Answer

A

models have generalisability through wide range of applications which they can be applied to
odel checking showed string performance in comparison with other methods
models consistently showed robust performance and good calibration, supporting their broader applicability

Question 5

Q

Were there any surprising or counter-intuitive findings?

Answer

A

Forensic: poor performance of headlamps in the integrated clustering approach whereas other methods did not perform as poorly
COVID-19: GDM-HMM captured the virus transitions / dynamics, even without explicit covariates, showing the clustering captured information such as movement between countries etc
Trees: sparse species predictions were better captured under the GDM than the GAM, showing that modelling had stronger benefits especially when species are small

(5 cards)