topic 4 Flashcards
(25 cards)
requirements of a good descriptor
- correlated with properties
- obeys physics
- adapted to size of the molecules
- cheap
- not related to other descriptors
what is a zero dimension descriptor
- provides atom and bond count
- tells us the molecular weight
- doesn’t provide info ab molecular structure or connectivity
what is a 1 dimensional descriptor
- lists of substructures/fragments such as functional groups
what is a 2 dimensional descriptor
- provide info on molecular topology
- often based on the graph representation of the molecules
what is a 3d descriptor
- provides information about spatial coordinates of atoms of a molecule
what is a 4d descriptor
- grid based that introduce a fourth dimension to a 3d descriptor
what is SMILES
simplifed molecular input line entry system
- is a specification in form of a line notation for describing the structure of chemical species using short ASC|| strings
what is a molecular graph
a connected, undirected graph which admits one to one correspondence with the structural formula
- vertices = atoms
- edges = chemical bonds
whats a morgan fingerprint
- representation of a molecules that identifies the presence of substructures/fragments within it
what is a coulomb matrix
a simple global descriptor which mimics the electrostatic interaction between nuclei
what does SOAP stand for
smooth overlap of atomic positions
what is SOAP
a descriptor that encodes regions of atomic geometries by using a local expansion of a gaussian smeared atomic density with orthonormal functions based on spherical harmonics and radial basis functions
why is feature selection important
- model explainability
- model debugging
- improve model performance
gini importance (mean decrease impurity)
- evaluate how feature reduces impurity of node in decision tree
( 0 to -0.5) = likelyhood new data being misclassified
permutation feature importance
- measures increase in prediction error of model after relationship between feature and true output breaks
SHapley additive exPlanations
- game theoretic approach (measures players conributions)
- ## values show how each feature effects final prediction, significance of each feature compared, models reliance
what is the curse of dimensionality
- number of data point is small relative to intrinsic dimension of the data
what does the curse of dimensionality cause
- data sparsity (empty space - clustering and classification challenging)
- increased time
- overfitting (noise)
- visulisation challenges)
wrappers
- use a model to score different subsets to features to finally select the best one
- performance evaluate on hold out set
filters
- evaluate usefulness of each feature
- statistical relation with model target
embedders
- embed selection into mL MODEL
- Combines speed of filters with getting subset for particular model
dimensionality reduuction
- reduce number random variable by obtaining a set of principle variables
- retian mlst important info
principal component analysis (pca)
- unsupervised machine learning method
- changes correlated original variables into linear combination of original variables called principal components
linear discriminant analysis
- supervised
- seperates 2 groups/classes
- focuses on maximizing separability among known categories by crreating new linear axis and projecting data points on to it
- useful classification