topic 4 Flashcards

(25 cards)

1
Q

requirements of a good descriptor

A
  • correlated with properties
  • obeys physics
  • adapted to size of the molecules
  • cheap
  • not related to other descriptors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a zero dimension descriptor

A
  • provides atom and bond count
  • tells us the molecular weight
  • doesn’t provide info ab molecular structure or connectivity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a 1 dimensional descriptor

A
  • lists of substructures/fragments such as functional groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a 2 dimensional descriptor

A
  • provide info on molecular topology
  • often based on the graph representation of the molecules
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a 3d descriptor

A
  • provides information about spatial coordinates of atoms of a molecule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a 4d descriptor

A
  • grid based that introduce a fourth dimension to a 3d descriptor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is SMILES

A

simplifed molecular input line entry system
- is a specification in form of a line notation for describing the structure of chemical species using short ASC|| strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is a molecular graph

A

a connected, undirected graph which admits one to one correspondence with the structural formula
- vertices = atoms
- edges = chemical bonds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

whats a morgan fingerprint

A
  • representation of a molecules that identifies the presence of substructures/fragments within it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a coulomb matrix

A

a simple global descriptor which mimics the electrostatic interaction between nuclei

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what does SOAP stand for

A

smooth overlap of atomic positions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is SOAP

A

a descriptor that encodes regions of atomic geometries by using a local expansion of a gaussian smeared atomic density with orthonormal functions based on spherical harmonics and radial basis functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why is feature selection important

A
  • model explainability
  • model debugging
  • improve model performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

gini importance (mean decrease impurity)

A
  • evaluate how feature reduces impurity of node in decision tree
    ( 0 to -0.5) = likelyhood new data being misclassified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

permutation feature importance

A
  • measures increase in prediction error of model after relationship between feature and true output breaks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SHapley additive exPlanations

A
  • game theoretic approach (measures players conributions)
  • ## values show how each feature effects final prediction, significance of each feature compared, models reliance
17
Q

what is the curse of dimensionality

A
  • number of data point is small relative to intrinsic dimension of the data
18
Q

what does the curse of dimensionality cause

A
  • data sparsity (empty space - clustering and classification challenging)
  • increased time
  • overfitting (noise)
  • visulisation challenges)
19
Q

wrappers

A
  • use a model to score different subsets to features to finally select the best one
  • performance evaluate on hold out set
20
Q

filters

A
  • evaluate usefulness of each feature
  • statistical relation with model target
21
Q

embedders

A
  • embed selection into mL MODEL
  • Combines speed of filters with getting subset for particular model
22
Q

dimensionality reduuction

A
  • reduce number random variable by obtaining a set of principle variables
  • retian mlst important info
23
Q

principal component analysis (pca)

A
  • unsupervised machine learning method
  • changes correlated original variables into linear combination of original variables called principal components
24
Q

linear discriminant analysis

A
  • supervised
  • seperates 2 groups/classes
  • focuses on maximizing separability among known categories by crreating new linear axis and projecting data points on to it
  • useful classification
25
t - distributed stochastic neighbour embedding
- unsupervised - non-linear dimensionality reduction - visulise high dimensiional datta sets