Flashcards in Lectures Deck (87):

1

## What is Bioinformatics?

###
Bioinformatics is the computational analysis of

biological data in order to solve biological

questions

• Interdisciplinary discipline at the intersection of

Biology and computer science (and occasionally

chemistry, physics,mathematics)

• often used interchangeably with computational

biology

• strongly data driven

• development of algorithms and tools

2

## Topics in Bioinformatics?

###
• Sequence analysis

• Genome annotation

• Structural Bioinformatics

• Phylogenetics

• Gene expression analysis

• Genetics and Population Analysis

• Databases and Ontologies

• Data and Text Mining

• Systems Biology

3

## Data sources

###
• Sequences

Gene sequences, genomes, personal genomes,

metagenomes

• Structures

High resolution structures from X-ray crystallography,

NMR Low resolution from cryo-EM, EPR, SAX

• Gene expression data

microarrays and transcriptome sequencing

• Protein abundances

Mass spectrometry

• In all areas exponential growth of data

• Rapid development of high-throughput technologies

• Analysis of most experiments is impossible without

efficient algorithms

4

## Revolution in Sequencing Costs

###
• Sequencing costs fall exponentially

• Since 2008 faster than Moore’s law

• Latest technology < $1000 per human genome

5

##
Sequence Analysis --> Foundation of bioinformatics?

###
• Sequence determines structure and ultimately

function

• Sequences are easier to obtain than structures

• Homologous sequences are likely to have same or

similar function

• But, sequences will evolve faster than structures and

function

• We need to be able to detect remote homolgies

6

## Homology vs. Similarity ?

###
• Two sequences are homologous if they are derived

from acommon ancestor

• Similarity ̸= Homology

• Homologous sequences are likely to exhibit some

similarity

• Short sequences might be similar because of

convergent evolution

• Significant similarity implies homology

7

## Measuring Sequence Similarity

###
Distance measures:

Simplest distance between strings: Hamming distance, number of differences

• Works only for sequences of equal length

• Example for a simple edit distance

• Considers point mutations only

AGUUCGAUGG

AGUCCGGUCG

δ(a, b) = { 1 if a ̸= b, 0 else

dH (AGUUCGAUGG, AGUCCGGUCG) = 3

8

## Edit Distances

###
An edit distance between two objects x, y is the minimum number (or cost) of operations needed to convert x into y

Edit distances almost automatically satisfy axioms of a metric:

d(x, y) = 0 ⇐⇒ x = y

d(x, y) = d(y, x) symmetry

d(x, y) ≤ d(x, z) + d(z, y) triangle equation

9

## The Levenshtein distance

###
• Simplest (earliest) sequence alignment

algorithm(1965)

• Computes an edit distance between two strings x, y

using insertion, deletion or substitution of a single

character

• Result can be displayed as an alignment

Example:

Levenshtein distance between “computation” and

“complication” is 3

Needs 2 substitutions, 1 insertion

Corresponding alignment:

comput-ation

complication

Computable in O(n · m) time

10

## Computing the Levenshtein distance

###
Can be computed efficiently via dynamic programming

Given two sequences x = x1 . . . xn, y = y1 . . . ym,

let d[i,j] be the Levenshtein distance of the prefixes

x1 . . . xi and y1 . . . yj, then

d[i, 0] = i

d[0, j] = j

d[i, j] = min d[i − 1, j] + 1

d[i, j − 1] + 1

d[i − 1, j − 1] + [xi ̸= yj]

for i, j > 0

11

## What’s an Alignment?

###
Given two sequence x = x1x2 . . . xn, y = y1y2 . . . ym, with xi , yi ∈ A.

An Alignment of x and y is obtained by inserting gap characters (’–’) so that both resulting strings are equal length.

An alignment encodes a sequence of edit operations to convert a sequence x into y (or vice versa) using substitutions, insertions and deletions.

It can be interpreted as an hypothesis of the evolutionary history of the two sequences.

Example:

Y E − S T E R D A Y

− E A S T E R − − −

means (del(x, 1), ins(x, 2, A), del(x, 7), del(x, 8), del(x, 9))

Since we cannot distinguish between an insertion in x and and deletion in y, we usually talk about “indels”.

12

##
What’s an Alignment?

###
An alignment (also called alignment trace) is a “matching” between two sequences.

Matching positions are written above each other, unmatched positions are written with a gap (’–’) on

the opposite side.

This suggest a functional interpretation of the alignment:

matching positions should fulfill the equivalent functions in the two biomolecules

13

## Counting Alignments

###
Alternatively, count the number of traces, i.e. possible

matchings.

AU-C

A-GC and

A-UC

AG-C

have the same matches, but different alignment.

The number of traces for two sequences of length n and m is

n+m

n

14

## What is the best Alignment?

###
The biologically correct alignment:

• The Alignment reflecting the true evolutionary history

• The alignment matching functionally equivalent

positions

The optimal Alignment:

• The alignment fulfilling an optimization criterion such

as Levenshtein distance

optimal ̸= correct. But good optimization criteria can help.

15

## The biologically correct alignment:

###
• The Alignment reflecting the true evolutionary history

• The alignment matching functionally equivalent

positions

16

## The optimal Alignment:

###
• The alignment fulfilling an optimization criterion such

as Levenshtein distance

17

## Similarity Alignments

###
Making alignments more realistic

• Use different scores for different substitutions

• Gaps should be more costly than mismatches

• Instead of minimizing a distance we can maximize a

similarity

• Easier to derive realistic similarity scores.

18

## The Needleman Wunsch Algorithm

###
Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel).

Let Si,j be the score of the optimal alignment of x1 . . . xi

with y1 . . . yj.

Si,j can be computed using the recursion

Si−1,j−1+ s(xi, yj)

Si,j = max Si,j−1+ g

Si−1,j+ g

with initialization S0,j= g · j and Si,0= g · i

19

## The Needleman Wunsch Algorithm

###
Example:

AATAC aligned with GATTAC

g = −3 s(a, a) = +2 s(a, b) = −1

- G A T T A C

- 0 -3 -6 -9 -12 -15 -18

A -3 -1 -1 -4 -7 -10 -13

A -6 -4 1 -2 -5 -5 -8

T -9 -7 -2 3 0 -3 -6

A -12 -10 -5 0 2 2 -1

C -15 -13 -8 -3 -1 1 4

The optimal alignment score 4 can be found in the lower right corner.

20

## Backtracing

###
• Filling the table S calculates the score of the optimal

alignment, but not the alignment itself.

• To obtain the alignment we backtrace, i.e. we ask how

each entry was obtained from the recursion.

• In the example, the final entry S5,6 = 4 was obtained

as S4,5 + s(C, C) = 2 + 2 = 4 Thus the alignment has to

end with a CC match and we continue backtracing

from S4,5.

• Backtracing yields a path through the DP matrix from

the lower right to the upper left corner

• There is a one-to-one correspondence between paths

through the matrix and alignment

• Often there will be more than one optimal alignment

21

## Variations of alignment

###
So far we considered global alignments, appropriate if both sequences are complete and homologous over the full length.

Often were this is not ideal because:

• Some of our sequences may be incomplete Leads to

lots of spurious gaps at beginning or end

• We may want to compare short with very long

sequences e.g. short sequence vs. a complete

genome

• We’re searching for local similarities

In the first two cases we need a semi-global alignment

In the last one a local alignment

22

## Semi-global or glocal Alignment

###
a. Aligning a short sequence (fragment) against a long

one

• Ignore gaps at the beginning and end of the short

sequence these do not represent insertion or

deletions

• Two changes to algorithm:

Change initialization S0,j = 0 (ignores leading gaps

in first seq)

Start backtracing at highest value in bottom row

(ignores trailing gaps)

b. Aligning two possibly incomplete sequences

• Solved by aligning with free end gaps

• Changes to algorithm:

Change initialization S0,j = Si,0 = 0 (ignore all

leading gaps)

Start backtracing at highest value in last row and

column (ignores trailing gaps)

Since we deal with incomplete sequences most of the time. Semi-global or “free endgap” alignment is often the default.

23

## Local Alignments

###
Often we are interested in only local similarity

• Sequences might be too diverged for meaningful

global alignment

• Otherwise unrelated sequences might share short

functional elements

• Helps focus on strongly conserved functionally

important regions

• Useful for aligning very long sequences

• local similarity is easily lost in global alignments

24

## Computing Local Alignments

###
At first glance local alignment seem very hard:

• Should compare every pair of subsequences

xi 1. . . xi 2 and yj 1. . . yj 2

• that would entail O(n^4) space and O(n^6) time

25

## Smith Waterman solution:

###
• Idea:

• We’re not interested in poor alignments (score < 0)

• Compute only 1 local alignment for every endpoint

xi and yj

• Same recursion as Needleman-Wunsch, except:

1 Initialize with 0 as in free end gap alignment

2 Replace negative scores with 0, marks start of a

new alignment

26

## Smith Waterman Alignment

###
Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel). Let Si,j be the score of the best local alignment ending at position i in x and j in y.

Same effort as Needleman Wunsch O(n · m) in memory and time!

27

## Smith Waterman Backtracing

###
• Find the highest entry in the table

• Begin normal backtracing from highest entry

• Stop as soon as you encouter a 0

28

##
Smith Waterman Backtracing

What if we want to find additional (suboptimal)

alignments?

### Smith Waterman Deblocking

29

## Smith Waterman Deblocking

###
• remove cells used in the first alignment by setting

them 0

• “repair” table by recomputing cells that might have

changed

• backtrace in new table

• repeat for as many alignments as desired

30

## Global alignment:

###
• Input: two potentially equivalent sequences

• Goal: identify conserved regions and differences

• Algorithm: Needleman-Wunsch dynamic programming

• Applications: comparing two related genes/proteins

31

## Semi-global alignment:

###
• Input: two sequences, one short and one long

• Goal: is the short one a part of the long one?

• Algorithm: free end gap modification of NW DP

• Applications: sequence assembly, overlap detection

32

## Local alignment:

###
• Input: two sequences that may or may not be related

• Goal: find regions of similarity between the sequences

• Algorithm: Smith-Waterman dynamic programming

• Applications: searching for local similarities in large

sequences, identification of conserved domains/motifs

33

## Better Substitution Scores

###
For DNA/RNA mostly ad-hoc score models used

• positive scores for matches

• negative for mismatches

• sometimes better score for transitions (G-A, T-C) than

for transversions (C-G, C-A, T-G, T-A)

• Negative scores are essential for local alignments

• More precisely, the expected score for random

sequences should be negative

34

## Better Substitution Scores

###
For DNA/RNA mostly ad-hoc score models used

• positive scores for matches

• negative for mismatches

• sometimes better score for transitions (G-A, T-C) than

for transversions (C-G, C-A, T-G, T-A)

• Negative scores are essential for local alignments

• More precisely, the expected score for random

sequences should be negative

35

## Reminder Conditional Probabilities

###
• P(A, B) is the joint probability for A and B happening

• P(A|B) is the conditional probability, i.e. that A

happens given that we know B

• Bayes’ theorem for conditional probabilities

P(A|B) · P(B) = P(A, B) = P(B|A) · P(A)

36

## Better Substitution Scores

###
For amino acids we need something more sophisticated

• We want to distinguish between homologous and

non-homologous sequences

• Compare two models:

Is the alignment more likely due to homologous or

random sequences?

• For simplicity consider gap free alignments Compute

P(x, y|R) (random model) and P(x, y|M) (model for

homologous sequences

• Assume each column of the alignment is independent

37

## Log Odds Scores

###
Let q(a) by the frequency of the letter a in a random sequence, and p(a, b) the probability of a column a

b in an alignment of homologous sequences.

Working with the logarithm of probabilities is more convenient with

s(a, b) = log (p(a, b))/

(q(a)q(b))

Log odds scores such as this are used widely. In general, the take the form

score = log observed freq/

expected freq

38

##
Dayhoff (PAM) Matrices

Where to get substitution probabilities p(a, b)?

### They also need to depend on time, i.e. p(a, b|t)

39

## Dayhoff (PAM) Matrices

###
• Margaret Dayhoff (1978) used hand crafted alignments

of 71 protein families

• Estimated amino acid frequencies ∑a q(a) = 1

• Counted observed substitutions fa,b= f b,a

• Total number of AA in substitutions

• Introduced PAM as time unit

• 1 PAM = time for 1 substitution per 100 amino acids

Probability for substitution a ⇒ b after time t

P(b|a, t) = p(a,b|t)/

q(a)

• Substitution score for sequences with distance kPAM

s(a, b|k PAM) = log P(b|a,k PAM)/

q(b)

• Matrix M with Mab = P(b|a, 1PAM) then

P(b|a, k PAM) = Mab(k) = (Mk)ab

• Allows to compute substitution score for arbitrary PAM

distance!

40

## PAM vs distance

###
PAM is a time unit, not equal to the number of observed

mutations.

We cannot see multiple substitutions of a single site.

%-Difference PAM %-Difference PAM

1 1 45 67

5 5 50 80

10 11 55 94

15 17 60 112

20 23 65 133

25 30 70 159

30 38 75 195

35 47 80 246

40 56 85 328

41

## BLOSUM Matrices

###
• Modern alternative to PAM matrices

• Uses conserved regions of distantly related proteins

• Series matrices for different relatedness

• E.g.: BLOSUM 60 constructed from pairwise

alignments with at most 60% identity

• Closely related sequences: low PAM, high BLOSUM

matrix

• E.g.: PAM250 ⇔ BLOSUM45.

42

## What about gap costs?

###
A T A C − − − C C

− − A C G G G C −

• Example: Three gaps of length 2, 3 and 1

cost = g(2) + g(3) + g(1)

• So far we have used linear gap costs g(k) = k · g1

• Long gaps treated as k indels of length 1

• Unrealistic: A single mutation event often inserts /

deletes several characters

• An alignment with few long gaps is more plausible

than one with many short gaps

• Linear gap costs produce alginments with too many

short gaps and very few long ones

43

## What is a good gap penalty function g(k)?

###
• g(k) should be subadditive g(k + l) ≤ g(k) + g(l)

• Examples:

• g(k) = go + b · k^2 — not subadditive

• g(k) = go + b ln(k) — most realistic

• g(k) = go + gext · (k − 1) — Affine gap cost

• Affine costs are most commonly used.

• Best tradeoff between speed and accuracy

• go is called opening cost gext extension cost

44

## Arbitarary Gap Costs

###
The problem with general gap costs

• If we split an alignment within a gap, scores do not

add up

• Can’t use Needleman-Wunsch algorithm

Variant for arbitrary gap costs: Smith-Waterman-Beyer

Cost: Memory O(n^2) (same as NW)

CPU O(n^3)

45

## Affine Gap Costs

###
• Reasonable approximation to realistic gap costs

• Computationally efficient

46

## Gotoh’s Algorithm

###
To correctly score a gap in the last column we need to know what the previous column looked like.

Introduce 3 dynamic programming tables:

Mij score of best alignment ending in match

Eij score of best alignment ending with gap in y

Fij score of best alignment ending with gap in x

Gotoh’s Recursions siehe Vorlesung 4 letzen Folien

47

## Database Searching

###
• Given a query sequence, find all similar sequences in

a large database

• Perhaps the most common bioinformatics task

• Need to compute semi-global or local alignment of

query and database

• Usually local alignment (more general)

48

## BLAST and FASTA Heuristics

###
• Local alignment with Smith-Waterman in O(n · m) time

• Too slow for searching very large databases

• Need heuristics that run much faster

• Won’t have guarantee to find optimal alignment

• Idea: good alignments should contain stretches that

are identical

• seed-and-extend strategy:

find small exact matches, extend to full local

alignment

49

## FASTA

###
1) Find all exact matches of length k (word length)

between query and database

2) Combine consecutive and overlapping exact

matches

3) Join matches on the same diagonal to form diagonal

runs → short gap free alignments

4) Score diagonal runs (e.g. BLOSUM) keep top N runs

(e.g N = 10)

5) Combine diagonal runs by allowing gaps between

them

6) Compute final local alignment using banded

Smith-Waterman

• Speed gain comes from the first step. Exact matching

in linear (O(n + m)) time

• Recommended values for k, DNA: k = 6, protein k = 2

• Finding remote protein homologies may require k = 1

→ poor performance

G D G K G

G E G R G

No k = 2 match, but very similar sequences

Even with k = 1 some alignments can be missed

50

## BLAST

###
Most used bioinformatics program of all time

• Similar to FASTA but

• Initial matches need not be exact

• instead exceed a threshold score T

• Example: Word length w = 4, T = 17, PAM-120 matrix

Query contains the sub-sequence (word) ACDE

s (ACDE

ACDE ) = 3 + 9 + 5 + 5 = 22 good

s (ACDE

GCDE ) = 1 + 9 + 5 + 5 = 20 good

s (ACDE

AADE ) = 3 − 3 + 5 + 5 = 10 bad

. . .

• We can again use fast exact matching algorithms

by first making a list of all high-scoring words

• Typically ≈50 words per residue in query

1) Decompose query into overlapping words of length w

2) Generate list of all possible w-words that score > T to

some word of the query

3) Scan database for all words in this list

4) Extend these seed hits until score falls some amount

X below the maximum so far

5) Return highest score observed → locally maximal high

scoring pair (HSP)

6) Compute statistical significance of each HSP (below)

The original BLAST only generates gap-free alignments (HSPs)

Typically, w = 3 for proteins, w = 11 for DNA/RNA

51

## BLAST2

###
Classical BLAST spends too much time extending hits

BLAST2: Two-hit-method

• Combine seed hits on the same diagonal with

distance < A

• Require at least two hits before reporting an HSP

• Needs less stringent threshold T

52

## Gapped BLAST

###
Classical BLAST spends too much time extending hits

• If score of HSP exceeds threshold Sg

• Fix match in the middle of highest scoring 11 residue

segment

• Extend using modified Smith-Waterman in both

directions

• Stop when score drops below some threshold

53

## BLAST Programs

###
The BLAST program comes in several variants:

• BLASTN: compares a DNA query sequence to a DNA

sequence database

• BLASTP: compares a protein query sequence to a

protein sequence database,

• TBLASTN: compares a protein query sequence to a

DNA sequence database (6 frames translation)

• BLASTX: compares a DNA query sequence (6 frames

translation) to a protein sequence database

• TBLASTX: compares a DNA query sequence (6

frames translation) to a DNA sequence database (6

frames translation).

54

## BLAST Statistics: The E-value

###
Given an HSP with score S. How often do we expect to achieve a hit with score S or better when scanning a database of random sequences?

E-value plausibility arguments:

• Has to be proportional to database size n

• By symmetry, proportional to query m

• Should fall exponentially with score

(Probability of a hit with score 2S equals probability of two hits with score S in a row)

• → E(S) = Kmne−λS

55

## BLAST Statistics: Where to get the parameters K and λ?

###
• K < 1 expresses that the effective lengths n and m are

slightly smaller

• λ is related to the scoring matrix

Log odds scores:

s(a, b) = 1/ log pab/

λ qaqb

pab = qaqb e ^λs(a,b)

56

## Score Distributions: What should we expect for scores of random alignments?

###
• Align query to random position in database.

How would the scores be distributed?

Sum of many independent values→ normal distribution

• Now choose the best position in the database.

What is the distribution now?

Take n values from a normal distribution, choose the

maximum → Extreme value (Gumbel) distribution

f(x) = 1/ exp (−(x−µ)/− e^−(x−µ)/β)

β β

K and λ in E-value, related to µ and β parameters of the

Gumbel distribution

57

## Score Statistics

###
- Gap free HSPs: λ and k (µ and λ) follow from score

matrix

More complicated models (e.g. alignments with gaps):

Postulate that alignment scores should still be extreme value distributed

1) Produce a large number of random alignments

2) Fit parameters of the Gumbel distribution

3) Derive the values of K and λ for E-values

Allows to compute empirical E-values

58

## P-values

###
Alternative to E-value:

What is the probability of obtaining at least one hit with score S or better?

• Number of hits should be Poisson distributed

P(x = k) = µ^k/ e^−µ

k!

• Poisson distribution gives the probability of k rare

events

Limit of the Binomial distribution for n → ∞ np = µ

• In our case µ = E(S)

P(x > 0) = 1 − P(x = 0) = 1 − e^−E

• P-value and E-value are directly connected

59

##
E- and P-values

For small E-values (very good hits) E ≈ P

###
E p

10 0.9999546

5 0.9932621

2 0.8646647

1 0.6321206

0.1 0.0951626

0.05 0.0487706

0.001 0.0009995

0.0001 0.0001000

60

## Bit Scores

###
• E- and P-values depend on scoring matrix and

database size

• Values from BLAST with different parameters are not

comparable

• How to “normalize” scores to make them comparable?

• Bit scores:

S′= (λS − lnK)/ ln 2

• Using bit scores we have:

E = mn · 2^−S

61

##
Summary Blast scores:

Raw alignment scores

Bit scores

P-values, E-values

###
• Raw alignment scores

Not comparable between different scoring matrices,

unclear significance

• Bit scores

Normalized and therefore comparable, independent

of database size

• P-values, E-values

Measure statistical significance, take size of database

into account

• True homologs may well have E-values > 1

• Protein homologs become hard to detect below 24%

identity — so-called “twilight zone”

62

## Multiple Sequence Alignments

###
One or two homologous sequences whisper . . . a full multiple alignment shouts out loud.

Hubbard et al. 1996

63

## Why Multiple Sequence Alignments?

###
• Allows to infer phylogenetic relationships

• Identify conserved regions – these are usually

functionally important

• Infer facts about protein 3D structure

• Infer secondary structure

• Distinguish hydrophobic inside from polar outside

• Identify transmembrane helices

• Infer long-range interactions

• Infer which mutations are acceptable without

destroying function

64

## Multiple Sequence Alignments

###
-MSAs identify highly conserved (functionally important)

regions

-MSAs help predict secondary structure and solvent

accessible surface

-The structure of N-acetylglucosamine-binding proteins

is held together by 4 disulphid bridges formed by 8

conserved cysteins.

Given sequences A1 . . . Ar , Ai = ai,1, ai,2, . . . ai,ni

, a multiple sequence alignment (MSA) A∗ is obtained by inserting gaps (“-”) such that all resulting sequences A∗i have equal length n, and no column consists of gaps only.

65

## Definition: Multiple Sequence Alignments

### Given sequences A1 . . . Ar , Ai = ai,1, ai,2, . . . ai,ni , a multiple sequence alignment (MSA) A∗ is obtained by inserting gaps (“-”) such that all resulting sequences A∗i have equal length n, and no column consists of gaps only.

66

## Pairwise vs Multiple Alignment

###
A multiple sequence alignment induces pairwise alignments, but not vice versa.

− a p p l e −

p a p − − e r

p e p p − e r

These three pairwise alignments are inconsistent

There exists no MSA that includes all three

1)

− a p p l e −

p a p − − e r

2)

− a p p l e −

p e p p − e r

3)

p a p − e r

p e p p e r

67

## Sum of Pair Score

###
The most commonly used score for MSAs is the Sum of Pair Score,

i.e. the sum scores of all induced pairwise alignments

often used as weighted sum of pair score

68

## Sum of Pair Score(Shortcomings:)

###
• Does not consider phylogeny

• Does not treat almost conserved columns right

69

## Sum of Pair Score

###
. . . N . . .

.

.

.

. . . N . . .

. . . C . . .

• The more “N” we see the more confident we are that the “N” is important

• ... and the more we want to penalize a mismatch

• However, SoP score is r (r−1)/ σ(N, N) + rσ(N, C)

2

• ... the larger r the smaller the relative penalty for

adding a “C”

70

## Needleman Wunsch for 3 Sequences

###
• Multiple alignment can be solved by a multi

dimensional Needleman Wunsch dynamic

programming algorithm

• For 3 sequences, 7 possibilities for final column

• Effort in Memory and CPU O(n^3)

71

## Multiple Alignment for more then 3 Sequences

###
• Dynamic programming can be generalized to r

sequences

• Need to consider 2r−1 cases for the last alignment

column

• Need to fill a r dimensional table

• Effort in Memory and CPU O(n^r)

• MSA is exponential in the nmber of seqences

NP hard problem

• rarely used for more than 3 sequences

• Heuristics needed for fast computation of MSAs

72

##
Aligning Alignments

Suppose we have two good alignments, e.g. for two phylgenetic groups

How do we get an alginment of all sequences?

###
• Build a new MSA from the individual sequences

Would destroy all the good work on the two

subalignments

• Somehow join the two alignments . . .

• An alignment is just a sequence on a more

complicated alphabet

• We can adapt pairwise alignment algorithms to align

two alignments

• Only thing that changes is the scoring of matches

73

## Profile Alignments

###
• Gaps are inserted into all rwos of the first or second

alignment

• None of the the existing gaps are ever lost

“Once a gap, always a gap”

• Called profile (or profile-profile) alignments

74

## Profiles

###
Profiles compress a sequence alignment into a vector of frequencies

C G − A

C A T A

A A A A

− A − A

A 25% 75% 25% 100%

C 50% 0% 0% 0%

T 0% 0% 25% 0%

G 0% 25% 0% 0%

− 25% 0% 50% 0%

Profiles-profile alignments don’t need the full alignment

75

## Progressive Multiple Alignment

###
Profile alignments can be used to build MSAs in steps

• Start from single sequences

• In each step perform a pairwise alignment, i.e.:

Algin two sequences, a sequence with an

alignment,or two alignments

• Repeat until we’re left with one MSA of all sequences

• Number of sequences / alignments reduced by 1 in

each step

• r − 1 pairwise alignments to produce an MSA or r

sequences

• CPU effort O(r · n^2) for r sequences of length n

76

##
Progressive Multiple Alignment

In what order should we align sequences?

###
• Best start with sequences that are most similar

• Leave the difficult alignments for last, when we have

more sequences

• Most common choice:

Build a phylogentic tree, align along the tree

Typical progressive alignment

• Perform pairwise alignments of all-against-all

sequences

• Build a guide tree (e.g. neighbor joining) from

pairwise distances

• Perform a progressive alignment following the guide

tree

• Effort O(r^2· n^2+ r · n^2)

77

## Limitations of Progressive Multiple Alignment

###
Progressive is a heuristic that greatly reduces the effort for computing an MSA

• The progressive alignment is usually not optimal

• The is no overall score that is optimized

• Main problem: “Once a gap always a gap”

but gaps inserted early may be wrong

78

## Example: Progressive Multiple Alignment:

###
Example:

Sequences x = GAAGTT, y = GACTT, z = GTACTT

x : G A A G T T

y : G A C − T T

x : G A A G T T

y : G A − C T T

x : G A A G T T

y : G A C − T T

z : G T A C T T

x : G A A G T T

y : G A − C T T

z : G T A C T T

Correct alignment of x and y only apparent when considering z

79

##
Beyond simple progressive multiple

Alignment

Progressive alignment is fast, but can yields poor results. How to improve it?

###
- Iterative refinement

Start with a progressive alignment, then try to make it

better

- Improved progressive alignment schemes

Try to include information on all sequences in every

step

- Consistency based approaches

80

## MUSCLE – OVERVIEW

###
• Basic Idea: Improve progressive alignments

through iterative refinement

• 3 stages of the algorithm

• At the completion of each, a multiple alignment

is available and the algorithm can be terminated

• Significant improvement in accuracy and speed

81

## MUSCLE – THE ALGORITHM

###
Stage 1: Draft Progressive – Builds a progressive

alignment

• Similarity of each pair of sequences is computed

using

• K-mer counting

• Constructing a global alignment and

determining fractionalidentity of the

sequences

• A tree is constructed and a root is identified

• A progressive alignment is built by following the

branching order of the tree, yielding a multiple

alignment

82

## MUSCLE – THE ALGORITHM

###
Stage 2: Improved Progressive – Improves the tree

• Similarity of each pair of sequences is computed

using fractional identity from the mutual alignment

• A tree is constructed by applying a clustering method

to the distance matrix

• The trees are compared; a set of nodes for which the

branching order has changed is identified

• A new alignment is built, the existing one is retained if

the order is unchanged

83

## MUSCLE – THE ALGORITHM

###
Stage 3: Refinement – Iterative Refinement

• An edge is deleted from a tree, dividing the

sequences into two disjoint subsets

• The profile (MA) of each subset is extracted

• The profiles are re-aligned to each other

• The score is computed, if the score has improved, the

alignment is retained, otherwise it is discarded

• Algorithm terminates at convergence

84

## T-COFFEE

###
Consistency based progressive alignment

85

## PROBCONS what does it stand for?

### Probabilistic Consistency based

86

## PROBCONS

###
• Alignment generation can be directly modeled as a

first order Markov process involving state emissions

and transitions

• Uses maximum expected accuracy alignment method

• Probabilistic consistency used as a scoring function

• Model parameters obtained using unsupervised

maximum likelihood methods

• Incorporate multiple sequence information in scoring

pairwise alignments

87