Chemometrics Flashcards
Define chemometrics
computationally intensive, multivariate (many variables) statistical analysis that is specifically applied to chemical systems or processes
What six things can chemometrics do?
1 - primarily to reduce complex datasets
- spectrometers give out lots of data which takes too long to go through manually
2 - identify and quantify sample groupings
- can see which samples are similar/dissimilar through quantitative value
3 - optimise experimental parameters
4 - isolate important variables and identify covariance
- some peaks are not important as some may be common to all samples (flat baseline region)
- covariance: which parts of data are reliant on each other, which ones are not independent/which ones tend to happen together
5 - provide reproducible measures of data
- anyone can take raw data from spectrum and put into chemometrics package and run on any instrument with same technique and parameters - get same results
- removes subjectivity
6 - allows for better visualisation of data
- easier to show picture when stand up as expert witness
Briefly explain the history of chemometrics
- it is not new (new in forensics but not in general)
- originally there was scepticism (was seen that if you needed algorithms then you must have bad data collected = poor science involved)
- chemometrics was routinely used in industry for process optimisation and quality control e.g. food and pharmaceutical industries
- aim is to maximise output and quality with minimal cost
- growing use in chemical engineering, biomedical sciences and materials science
- since 2009 - emerging use in forensic science (improved efficiency in forensic workflow and better quality of forensic provision)
What did the national academy of sciences (NAS) report published in 2009 say?
- forensic science is a mess and needs sorting out
- a need for statistical framework - hence chemometrics
- want replacement of unique/indistinguishable/match with numerical probabilities in Q vs K comparisons
- need standard terminology across all disciplines of forensic science
- not knowing what happened at scenes will help counteract bias
- chemometrics is quicker than manual data interpretation (cost efficient)
- chemometrics can help use models to predict trace behaviour (background, transfer, persistence and activity level) - model how a trace would be expected to transfer/persist in an environment given certain factors
- chemometrics does not negate need for expert - eliminates lot of subjectivity but still need human to interpret final result
Describe the difference between univariate and multivariate
- multivariate means many variables
- univariate means one variable (e.g. melting/boiling points)
What are the spectra in forensics (univariate or multivariate)?
- univariate approach is too simplistic for complex data. it doesn’t take covariance into account
- for example, faults can only be detected when MVA applied
- in forensics they are multivariate
- we must consider transfer, background, persistence and activity level
Describe three situations where MVA might be beneficial
1 - considering pollen as a form of TE
- likelihood of finding pollen will be much higher at certain times of year
2 - when someone puts a fingermark down how sweaty they are relates to temperature
- on a hot day will be more sebum in print and leave a more patent rather than latent print
3 - titanium dioxide pigment is used in makeup in two ways: active ingredient (sunscreen) layer or interference effect (shimmer/sheen)
- if spectrum shows it contains both titanium dioxide and mica = covariant and points in direction of interference pigment
- if spectrum shows it contains titanium dioxide, no mica but zinc oxide (another sunscreen with different UV range blockage) - would indicate SPF makeup with broad spectrum coverage
What are the four categories of chemometrics?
1 - design of experiments (helps you to design better experiments more effectively to get maximum amount of data out of it)
2 - exploratory data analysis (what is data showing me, how do samples compare, similar/dissimilar)
3 - classification (models for things like transfer persistence using model created in EDA)
4 - regression
Explain design of experiments/DOE (what will it improve in future of FS, what will it effect, how does it work)
- relates to experimental set up
- will be used in future to streamline FS provision - improve efficiency, quality and reproducibility (how many experiments can do in one day, ensure data and way analysing it is scientifically robust, will get same answer with diff analyst on diff day on diff machine, will get same interpretation from diff analyst)
- DOE will affect evidence collection, storage, instrument selection, parameter optimisation etc.
- DOE image shows dots in 4 corners (measurements we have taken) and DOE interpolates between these measured parameters
Explain regression analysis (what is it, how does it work, what does it allow for, give an example)
- chemometric version of a calibration curve based on y = mx + c linear relationship but this time multivariate
- it maps the effect of multiple independent variables (predictors) upon dependent variable (response)
- allows prediction of quantitative sample properties (puts numbers on things)
- for example
- ink deposition on paper and someone asks how ink would change over time
- use regression analysis, if have seen day 6 can use regression to suggest what would have looked like on day 5
Explain exploratory data analysis/EDA (what is it, how does it work, what three things does it allow for, what are two most commonly used EDA techniques)
- dimensionality reduction (data mining)
- it reduces data that has many variables e.g. raman spectra into just a few measures called principle components
- pattern recognition technique - identify groupings and patterns
- helps visualise trends that may have otherwise gone unnoticed
- determination of sample similarity in complex data and gives this a number
- cluster analysis (CA)
- principal component analysis (PCA)
Other than presence/absence of peaks, what else can be a sign if a sample is similar or dissimilar
not always presence/absence of peaks that is the tell-tale as to whether a sample is similar or dissimilar – sometimes it is the relationship between peaks
Describe the difference between an unsupervised and a supervised technique?
unsupervised - exploring the data without any prior assumptions or knowledge of the samples
supervised - building classification rules for known sample groupings
describe cluster analysis (supervised/unsupervised, what is it, two types, what is the output, why is it not entirely objective, what does analyst have to decide, 3 positives and 1 negative)
- unsupervised
- samples grouped into clusters based on calculated distance (measure of their similarity)
- agglomerative - individual samples into clusters
- hierarchal - cluster into individual samples
- they are opposites
- output is dendrogram
- there are different ways of calculating distances and linking criteria and this is a decision a human needs to make (introduces subjectivity)
- analyst decides on stopping rules to determine number of clusters arbitrarily (must state stopping rule)
- good initial technique as it simplifies complex data (as it is dimensionality reduction technique(
- not limited to quantitative data (can use numbers and types of animals
- visualisation of relationships
- however can only tell you there are groupings but not why
describe principal component analysis (supervised/unsupervised, better/worse than CA, what is process)
- unsupervised
- superlative to cluster analysis
- assesses all variables within a dataset e.g. spectrum and then decides which are relevant
- it then determines which variables are correlated
- where the algorithm finds correlation, defines it as a principal component (PC)
- PC that describes largest variation between samples will be given PC1
- if first PC not sufficient to describe spread of data (it isn’t), then calculation repeated to find PC2 (at right angle to PC1) - looks at residual variance to find next amount of variation
- process continued until all variability within dataset has been accounted for and modelled
- stop when modelling noise
Why is it not good if data can be described by 1 PC?
there is not enough variation in data
What is the aim of our principal component analysis model?
to create a model that captures as much info in dataset in as few PCs as possible
this is counterintuitive as we have just said more PCs = more variations described
What 2 components are data (spectra, chromatograms etc) comprised of?
What does PC describe?
What info is left over?
What is ideal model based on these two components?
When can non-ideal model occur?
structure and noise
- PC describes structure (explained variance)
- important bits - tell you where sample groupings are
- whatever info is leftover is random noise (residual variance) from instruments in lab, temp fluctuations etc.
- stuff isn’t useful when modelling the data
- model with structure and no noise (when using model for classification will be tricky as unknown sample forced into model with loads of noise)
- above can happen when using more and more PC’s
What is each PC?
Each PC is a linear combination of the original variables (wavenumbers from spectral output)
Define score
- the distance along each PC from the mean to the sample (where mean = mean of all samples)
- can be positive or negative
- each sample will have a different score for each PC until we decide to stop using PCs when all we are modelling is noise
What is the correlation between number of PCs and number of original variables
the model can have as many PCs as original variables
Where is majority of good information?
In what scenario will we need more PCs?
- in first few PC’s (especially if we have good robust data set)
- if have vastly different samples and lots of variance then need more PCs to accurately account for all info and variance in set
What helps us determine the number of optimum PCs that we want to retain in a model?
How is this done?
What are two better ways to do this?
- an explained variance (or scree) plot
- > 1 % variance rule is where keep all PCs until you get to a point where you are adding less than 1 % between PCs
- inspection of scores pattern or check loadings
What is scores plot?
How is this used to help determine optimum PCs we want to remain in model?
- scores plot is mapped with the samples where each data point is 1 sample and similar samples are clustering but we can say what PCs we want to map
- scores plot will change depending on what PCs are mapped
- for example if plot PC1/4 and PC1/5 and they show the same then know not to include 5 as it isn’t showing anything useful