Machine Learning and Protein Structure Flashcards

Week 10 Lecture 2

1
Q

What are artificial neural networks?

A

Artificial neural networks comprise many switching units (artificial neurons) that are connected according to a specific network architecture. The objective of an artificial neural network is to learn how to transform inputs into meaningful outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Activation functions

A
  • tanh x
  • sigmoid: originally the most popular non-linear activation function used in neural nets
  • rectifier: now the most commonly used as it works well for deep networks, technically still a non-linear function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a transformer?

A
  • Take a set of vectors representing something and transform them into a new set of vectors which have extra contextual information added
  • These are sequence-to-sequence methods
  • They treat data as sets and so are permutation invariant (don’t take order into account, although this can be fixed by adding a position term to the vector encoding of the tokens)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Simple 2-layer neural network

A
  • A network is made smarter by adding layers
  • Connecting every node to other nodes in adjacent layers enables you to use matrix multiplication
  • Matrices in blue lines show hidden layers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Initially

How do we embed tokens?

A
  • Tokens need to be converted into numbers to use them in the neural network
  • Embed them in a high dimensional space and set each token to a vector of numbers
  • The numbers are initially random and the embeddings are stored in a lookup table, the same vector is always equal to the same token
  • The vector has a position in high dimensional space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we measure the similarity between two vectors?

A
  • Represent the tokens as vectors and use Euclidean distance in any number of dimensions
  • Cosine similarity is better because:
    1. It measures the angle between them
    2. It is independent of the vector length
  • The smaller the angle the more similar they are to each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are word embeddings done?

A
  • Place the words randomly initially in high dimensional space
  • Process the words such that similar or related words are close together
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Transformer encoder

A

Input vectors –> transformed vectors –> final outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Scaled dot-product attention

A
  • Calculates the dot product of every pair of vectors in the input
  • Q (queries) and K (keys) are tensors made of the same set of vectors and are populated by the vectors that are assigned to the words in the sentence
  • The system tries to calculate how similar the words are
  • K is transposed and K and Q are multiplied and then normalised by dividing the dot product by the square root of the number of dimensions
  • Run this through a softmax function so that all the values of a row add up to 1
  • These values become the attention weights and we use these weights to generate a weighted average of the input tensor V
  • The final output is a combination of the original vectors added together and multiplied by the softmax weights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Scaled dot-product self-attention

A
  • Dot products are calculated between all pairs of input vectors
  • Values are scaled by the square root of the input vector dimensions
  • Similarities are normalised so that each row of the similarity matrix adds up to 1 using the softmax function. This is called the attention matrix.
  • Each row of the attention matrix is used as the weights in a weighted average of the value input vectors.
  • These weighted averages become the new output vectors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Multi-head attention

A
  • Scaled dot-product attention cannot be trained and is not adjustable in any way
  • Q, K, and V can be run through linear perceptrons which have weights and can be trained
  • SDPA takes the place of the non-linearity function in the original neural nets
  • You can have multiple SDPA blocks that all take in the same input and focus on different aspects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bidirectional Encoder Representations from Transformers (BERT)

A
  • Hide parts of the input sentence and make the transformer return a probability distribution of what the likely missing part of the sentence is
  • Updates the weights which makes it understand the context better
  • Can make the transformer understand context and language
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BERT loss

A
  • The transformer is used to encode randomly masked versions of the same text
  • Replace the original word/letter/amino acid with a placeholder which means don’t know
  • The transformer is trained to predict the tokens that have been masked out correctly
  • Scored with a cross-entropy loss function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cross-entropy loss function

A

Assesses the probability of the correct word in the probability distribution that the network outputs. If the correct word has a high probability in the distribution you get a low loss value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Different ways of using language models

A
  • Single-task supervised training
  • Unsupervised pre-training + supervised fine-tuning
  • Unsupervised pre-training + supervised training of small downstream classifier
  • Future: unsupervised pre-training at scale + prompting (few-shot learning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Single-task supervised training

A

Train a transformer LM on labelled sequences to predict correct labels

17
Q

Unsupervised pre-training + supervised fine-tuning

A

Train a transformer LM using BERT on unlabelled sequences. Then add a new output layer and continue training it on labelled sequences to predict correct labels.

18
Q

Unsupervised pre-training + supervised training of small downstream classifiers

A

Train a transformer LM using BERT on unlabelled sequences then freeze the weights. Use the frozen LM outputs on labelled sequences to generate inputs to train a new model to predict labels.

19
Q

Future: unsupervised pre-training at scale + prompting (few-shot learning)

A

Train a very large transformer LM autoregressively on very diverse data. Then try to find suitable prompts to induce it to predict correct labels.

20
Q

Using a pre-trained protein language model

A
  1. Input sequence
  2. A fixed pre-trained language model generates embeddings of each residue
  3. Average the embeddings to produce a summary vector
  4. The summary vector is used to train a specialised neural net
21
Q

Correlated mutations in proteins

A
  • Residues in close proximity have a tendency to covary, probably in order to maintain a stable microenvironment
  • Changes at one site can be compensated for by a mutation in another site
  • These spatial constraints leave an evolutionary record within the sequence
  • By observing patterns of covarying residues in deep MSAs of homologous sequences, we can infer this structural information
22
Q

Predicting the 3D structure of proteins by amino acid co-evolution

A
  • We can produce accurate lists of contacting residues from covariation observed in large MSAs
  • If we have an efficient way to project this information into 3D space whilst satisfying the physicochemical constraints of protein chains then we have everything we need to predict 3D structure
23
Q

What does AlphaFold2 do?

A
  • Encodes a MSA using transformer blocks to produce an embedding
  • Decodes the MSA embedding to generate 3D coordinates
24
Q

BERT for training AF

A
  • Mask out random amino acids
  • The system has to guess what they are
  • Learns the context of amino acids for this given protein and homologues
  • Network learns about co-variation
25
Q

AlphaFold EvoFormer Block

A

MSA representation tower:
- The neural network prioritises looking for row-wise relationships between residue pairs and the input sequence before considering column-wise information that evaluates each residue’s importance in the context of other sequences.

Pair representation tower:
- Evaluates the relationship between every 2 residues (nodes) to refine the proximities or edges between the two.
- It achieves this by triangulating the relationship of each node in a pair relative to a third node.
- The goal is to help the network satisfy the triangle inequality theorem.

26
Q

How does AlphaFold2 produce a 3D structure?

A
  • The simple neural network reduces the number of input dimensions from 10 to 3
  • The 3 outputs are weighted sums of the inputs
  • The operation is called a projection and is used many times in AlphaFold 2 (dimensions are usually 256 but can be 384 or 128)
  • Each layer in the neural network can either reduce or increase dimensions
  • Converts the MSA into a 3D structure
27
Q

The structure module

A
  • A neural network that takes the refined models and performs rotations and translations on each amino acid revealing an initial guess of the 3D protein structure.
  • Also applies physical and chemical constraints determined by atomic bonds, angles, and torsional angles.
  • The refined models as well as the output of the structure module is iterated back through the evo former and structure module process 3 more times for a total of 4 cycles before it arrives at the final result: predicted 3D atomic coordinates for the protein’s 3D structure.
28
Q

Training AlphaFold2

A

Things you need:
- Known 3D structures
- MSAs
- Train AlphaFold2 to translate from a given MSA to the correct native 3D coordinates of the protein chain

29
Q

Limitations of AlphaFold2

A
  • Model quality depends on having good MSAs
  • Reliance on evolutionary information means that AF2 cannot predict mutation effects or antibody structures
  • Only produces a single maximum likelihood conformation and doesn’t predict conformational change
  • Works in terms of MSAs and not single sequences
30
Q

AlphaFold-Multimer to model multimers

A
  • Co-evolution can be observed between protein chains that are in contact with multimers
  • AF can model multimeric structures with properly paired MSAs
  • Success rate is about 50% (limited by MSA quality and interface region size)
  • Doesn’t work for antibody-antigen docking