Evaluation Flashcards
(5 cards)
1
Q
How did you conduct posterior predictive checking?
A
- performed posterior predictive model checking through generating replicate datasets using the posterior estimates of each parameter for each iteration of the model
- compared this with summary statistics (e.g. mean, median, sd) to the original data summary statistics
- aim is that the simulated summary statistics match the original data statistics, showing that it is a sensible / plausible value of the distribution of the simulated data
- visually examined this through density plots identifying the distribution of the replicate values and how the original data fit within this
2
Q
How did you design and interpret your simulation experiments?
A
- posterior predictive checks were designed by creating replicate datasets to identify is these matched with the original data values
- examined these using different measures to assess how well the replicate values matched with the original values (e.g. MAE, RMSE)
- allowed comparison with other models to examine how well my approach was performing in comparison
3
Q
What are the strengths and weaknesses of your evaluation metrics?
A
- Brier Score: balances both accuracy and confidence, but can be sensitive to class imbalance
- Expected Calibration Error (ECE): effectively captures probabilistic calibration, but requires binning and may be unstable in sparse datasets
- Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE): provide intuitive measures of prediction error, but does not reflect uncertainty or probabilistic structure, need interpreted alongside other metrics
- Bayesian Coverage of Uncertainty Intervals: validates that the model’s uncertainty quantification is well calibrated, but can be difficult to interpret in high dimensions and misleading if not interpreted alongside another measure such as width of intervals
4
Q
How confident are you in your models’ generalisability?
A
- models have generalisability through wide range of applications which they can be applied to
- odel checking showed string performance in comparison with other methods
- models consistently showed robust performance and good calibration, supporting their broader applicability
5
Q
Were there any surprising or counter-intuitive findings?
A
- Forensic: poor performance of headlamps in the integrated clustering approach whereas other methods did not perform as poorly
- COVID-19: GDM-HMM captured the virus transitions / dynamics, even without explicit covariates, showing the clustering captured information such as movement between countries etc
- Trees: sparse species predictions were better captured under the GDM than the GAM, showing that modelling had stronger benefits especially when species are small