Lectures Flashcards Preview

zzz..... uni......Bioinformatics > Lectures > Flashcards

Flashcards in Lectures Deck (87):
1

What is Bioinformatics?

Bioinformatics is the computational analysis of
biological data in order to solve biological
questions

• Interdisciplinary discipline at the intersection of
Biology and computer science (and occasionally
chemistry, physics,mathematics)

• often used interchangeably with computational
biology

• strongly data driven

• development of algorithms and tools

2

Topics in Bioinformatics?

• Sequence analysis
• Genome annotation
• Structural Bioinformatics
• Phylogenetics
• Gene expression analysis
• Genetics and Population Analysis
• Databases and Ontologies
• Data and Text Mining
• Systems Biology

3

Data sources

• Sequences
Gene sequences, genomes, personal genomes,
metagenomes

• Structures
High resolution structures from X-ray crystallography,
NMR Low resolution from cryo-EM, EPR, SAX

• Gene expression data
microarrays and transcriptome sequencing

• Protein abundances
Mass spectrometry

• In all areas exponential growth of data
• Rapid development of high-throughput technologies
• Analysis of most experiments is impossible without
efficient algorithms

4

Revolution in Sequencing Costs

• Sequencing costs fall exponentially
• Since 2008 faster than Moore’s law
• Latest technology < $1000 per human genome

5

Sequence Analysis --> Foundation of bioinformatics?

• Sequence determines structure and ultimately
function

• Sequences are easier to obtain than structures

• Homologous sequences are likely to have same or
similar function

• But, sequences will evolve faster than structures and
function

• We need to be able to detect remote homolgies

6

Homology vs. Similarity ?

• Two sequences are homologous if they are derived
from acommon ancestor

• Similarity ̸= Homology

• Homologous sequences are likely to exhibit some
similarity

• Short sequences might be similar because of
convergent evolution

• Significant similarity implies homology

7

Measuring Sequence Similarity

Distance measures:
Simplest distance between strings: Hamming distance, number of differences

• Works only for sequences of equal length
• Example for a simple edit distance
• Considers point mutations only

AGUUCGAUGG
AGUCCGGUCG

δ(a, b) = { 1 if a ̸= b, 0 else

dH (AGUUCGAUGG, AGUCCGGUCG) = 3

8

Edit Distances

An edit distance between two objects x, y is the minimum number (or cost) of operations needed to convert x into y

Edit distances almost automatically satisfy axioms of a metric:

d(x, y) = 0 ⇐⇒ x = y
d(x, y) = d(y, x) symmetry
d(x, y) ≤ d(x, z) + d(z, y) triangle equation

9

The Levenshtein distance

• Simplest (earliest) sequence alignment
algorithm(1965)

• Computes an edit distance between two strings x, y
using insertion, deletion or substitution of a single
character

• Result can be displayed as an alignment

Example:
Levenshtein distance between “computation” and
“complication” is 3

Needs 2 substitutions, 1 insertion
Corresponding alignment:
comput-ation
complication

Computable in O(n · m) time

10

Computing the Levenshtein distance

Can be computed efficiently via dynamic programming
Given two sequences x = x1 . . . xn, y = y1 . . . ym,
let d[i,j] be the Levenshtein distance of the prefixes
x1 . . . xi and y1 . . . yj, then

d[i, 0] = i
d[0, j] = j

d[i, j] = min d[i − 1, j] + 1
d[i, j − 1] + 1
d[i − 1, j − 1] + [xi ̸= yj]
for i, j > 0

11

What’s an Alignment?

Given two sequence x = x1x2 . . . xn, y = y1y2 . . . ym, with xi , yi ∈ A.

An Alignment of x and y is obtained by inserting gap characters (’–’) so that both resulting strings are equal length.

An alignment encodes a sequence of edit operations to convert a sequence x into y (or vice versa) using substitutions, insertions and deletions.

It can be interpreted as an hypothesis of the evolutionary history of the two sequences.


Example:

Y E − S T E R D A Y
− E A S T E R − − −

means (del(x, 1), ins(x, 2, A), del(x, 7), del(x, 8), del(x, 9))

Since we cannot distinguish between an insertion in x and and deletion in y, we usually talk about “indels”.

12

What’s an Alignment?

An alignment (also called alignment trace) is a “matching” between two sequences.

Matching positions are written above each other, unmatched positions are written with a gap (’–’) on
the opposite side.

This suggest a functional interpretation of the alignment:
matching positions should fulfill the equivalent functions in the two biomolecules

13

Counting Alignments

Alternatively, count the number of traces, i.e. possible
matchings.

AU-C
A-GC and

A-UC
AG-C

have the same matches, but different alignment.

The number of traces for two sequences of length n and m is

n+m
n

14

What is the best Alignment?

The biologically correct alignment:

• The Alignment reflecting the true evolutionary history
• The alignment matching functionally equivalent
positions

The optimal Alignment:

• The alignment fulfilling an optimization criterion such
as Levenshtein distance

optimal ̸= correct. But good optimization criteria can help.

15

The biologically correct alignment:

• The Alignment reflecting the true evolutionary history
• The alignment matching functionally equivalent
positions

16

The optimal Alignment:

• The alignment fulfilling an optimization criterion such
as Levenshtein distance

17

Similarity Alignments

Making alignments more realistic

• Use different scores for different substitutions

• Gaps should be more costly than mismatches

• Instead of minimizing a distance we can maximize a
similarity

• Easier to derive realistic similarity scores.

18

The Needleman Wunsch Algorithm

Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel).

Let Si,j be the score of the optimal alignment of x1 . . . xi
with y1 . . . yj.

Si,j can be computed using the recursion

Si−1,j−1+ s(xi, yj)
Si,j = max Si,j−1+ g
Si−1,j+ g

with initialization S0,j= g · j and Si,0= g · i

19

The Needleman Wunsch Algorithm

Example:
AATAC aligned with GATTAC
g = −3 s(a, a) = +2 s(a, b) = −1

- G A T T A C
- 0 -3 -6 -9 -12 -15 -18
A -3 -1 -1 -4 -7 -10 -13
A -6 -4 1 -2 -5 -5 -8
T -9 -7 -2 3 0 -3 -6
A -12 -10 -5 0 2 2 -1
C -15 -13 -8 -3 -1 1 4

The optimal alignment score 4 can be found in the lower right corner.

20

Backtracing

• Filling the table S calculates the score of the optimal
alignment, but not the alignment itself.

• To obtain the alignment we backtrace, i.e. we ask how
each entry was obtained from the recursion.

• In the example, the final entry S5,6 = 4 was obtained
as S4,5 + s(C, C) = 2 + 2 = 4 Thus the alignment has to
end with a CC match and we continue backtracing
from S4,5.

• Backtracing yields a path through the DP matrix from
the lower right to the upper left corner

• There is a one-to-one correspondence between paths
through the matrix and alignment

• Often there will be more than one optimal alignment

21

Variations of alignment

So far we considered global alignments, appropriate if both sequences are complete and homologous over the full length.

Often were this is not ideal because:

• Some of our sequences may be incomplete Leads to
lots of spurious gaps at beginning or end

• We may want to compare short with very long
sequences e.g. short sequence vs. a complete
genome

• We’re searching for local similarities

In the first two cases we need a semi-global alignment
In the last one a local alignment

22

Semi-global or glocal Alignment

a. Aligning a short sequence (fragment) against a long
one

• Ignore gaps at the beginning and end of the short
sequence these do not represent insertion or
deletions

• Two changes to algorithm:
Change initialization S0,j = 0 (ignores leading gaps
in first seq)
Start backtracing at highest value in bottom row
(ignores trailing gaps)

b. Aligning two possibly incomplete sequences

• Solved by aligning with free end gaps
• Changes to algorithm:
Change initialization S0,j = Si,0 = 0 (ignore all
leading gaps)
Start backtracing at highest value in last row and
column (ignores trailing gaps)

Since we deal with incomplete sequences most of the time. Semi-global or “free endgap” alignment is often the default.

23

Local Alignments

Often we are interested in only local similarity

• Sequences might be too diverged for meaningful
global alignment

• Otherwise unrelated sequences might share short
functional elements

• Helps focus on strongly conserved functionally
important regions

• Useful for aligning very long sequences

• local similarity is easily lost in global alignments

24

Computing Local Alignments

At first glance local alignment seem very hard:
• Should compare every pair of subsequences
xi 1. . . xi 2 and yj 1. . . yj 2
• that would entail O(n^4) space and O(n^6) time

25

Smith Waterman solution:

• Idea:
• We’re not interested in poor alignments (score < 0)
• Compute only 1 local alignment for every endpoint
xi and yj

• Same recursion as Needleman-Wunsch, except:
1 Initialize with 0 as in free end gap alignment
2 Replace negative scores with 0, marks start of a
new alignment

26

Smith Waterman Alignment

Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel). Let Si,j be the score of the best local alignment ending at position i in x and j in y.

Same effort as Needleman Wunsch O(n · m) in memory and time!

27

Smith Waterman Backtracing

• Find the highest entry in the table
• Begin normal backtracing from highest entry
• Stop as soon as you encouter a 0

28

Smith Waterman Backtracing

What if we want to find additional (suboptimal)
alignments?

Smith Waterman Deblocking

29

Smith Waterman Deblocking

• remove cells used in the first alignment by setting
them 0
• “repair” table by recomputing cells that might have
changed
• backtrace in new table
• repeat for as many alignments as desired

30

Global alignment:

• Input: two potentially equivalent sequences
• Goal: identify conserved regions and differences
• Algorithm: Needleman-Wunsch dynamic programming
• Applications: comparing two related genes/proteins

31

Semi-global alignment:

• Input: two sequences, one short and one long
• Goal: is the short one a part of the long one?
• Algorithm: free end gap modification of NW DP
• Applications: sequence assembly, overlap detection

32

Local alignment:

• Input: two sequences that may or may not be related
• Goal: find regions of similarity between the sequences
• Algorithm: Smith-Waterman dynamic programming
• Applications: searching for local similarities in large
sequences, identification of conserved domains/motifs

33

Better Substitution Scores

For DNA/RNA mostly ad-hoc score models used

• positive scores for matches
• negative for mismatches
• sometimes better score for transitions (G-A, T-C) than
for transversions (C-G, C-A, T-G, T-A)
• Negative scores are essential for local alignments
• More precisely, the expected score for random
sequences should be negative

34

Better Substitution Scores

For DNA/RNA mostly ad-hoc score models used

• positive scores for matches
• negative for mismatches
• sometimes better score for transitions (G-A, T-C) than
for transversions (C-G, C-A, T-G, T-A)
• Negative scores are essential for local alignments
• More precisely, the expected score for random
sequences should be negative

35

Reminder Conditional Probabilities

• P(A, B) is the joint probability for A and B happening

• P(A|B) is the conditional probability, i.e. that A
happens given that we know B

• Bayes’ theorem for conditional probabilities

P(A|B) · P(B) = P(A, B) = P(B|A) · P(A)

36

Better Substitution Scores

For amino acids we need something more sophisticated

• We want to distinguish between homologous and
non-homologous sequences

• Compare two models:
Is the alignment more likely due to homologous or
random sequences?

• For simplicity consider gap free alignments Compute
P(x, y|R) (random model) and P(x, y|M) (model for
homologous sequences

• Assume each column of the alignment is independent

37

Log Odds Scores

Let q(a) by the frequency of the letter a in a random sequence, and p(a, b) the probability of a column a
b in an alignment of homologous sequences.

Working with the logarithm of probabilities is more convenient with

s(a, b) = log (p(a, b))/
(q(a)q(b))

Log odds scores such as this are used widely. In general, the take the form

score = log observed freq/
expected freq

38

Dayhoff (PAM) Matrices

Where to get substitution probabilities p(a, b)?

They also need to depend on time, i.e. p(a, b|t)

39

Dayhoff (PAM) Matrices

• Margaret Dayhoff (1978) used hand crafted alignments
of 71 protein families
• Estimated amino acid frequencies ∑a q(a) = 1
• Counted observed substitutions fa,b= f b,a
• Total number of AA in substitutions

• Introduced PAM as time unit
• 1 PAM = time for 1 substitution per 100 amino acids

Probability for substitution a ⇒ b after time t

P(b|a, t) = p(a,b|t)/
q(a)

• Substitution score for sequences with distance kPAM

s(a, b|k PAM) = log P(b|a,k PAM)/
q(b)

• Matrix M with Mab = P(b|a, 1PAM) then
P(b|a, k PAM) = Mab(k) = (Mk)ab

• Allows to compute substitution score for arbitrary PAM
distance!

40

PAM vs distance

PAM is a time unit, not equal to the number of observed
mutations.

We cannot see multiple substitutions of a single site.

%-Difference PAM %-Difference PAM
1 1 45 67
5 5 50 80
10 11 55 94
15 17 60 112
20 23 65 133
25 30 70 159
30 38 75 195
35 47 80 246
40 56 85 328

41

BLOSUM Matrices

• Modern alternative to PAM matrices

• Uses conserved regions of distantly related proteins

• Series matrices for different relatedness

• E.g.: BLOSUM 60 constructed from pairwise
alignments with at most 60% identity

• Closely related sequences: low PAM, high BLOSUM
matrix

• E.g.: PAM250 ⇔ BLOSUM45.

42

What about gap costs?

A T A C − − − C C
− − A C G G G C −

• Example: Three gaps of length 2, 3 and 1
cost = g(2) + g(3) + g(1)

• So far we have used linear gap costs g(k) = k · g1
• Long gaps treated as k indels of length 1

• Unrealistic: A single mutation event often inserts /
deletes several characters

• An alignment with few long gaps is more plausible
than one with many short gaps

• Linear gap costs produce alginments with too many
short gaps and very few long ones

43

What is a good gap penalty function g(k)?

• g(k) should be subadditive g(k + l) ≤ g(k) + g(l)

• Examples:
• g(k) = go + b · k^2 — not subadditive
• g(k) = go + b ln(k) — most realistic
• g(k) = go + gext · (k − 1) — Affine gap cost

• Affine costs are most commonly used.
• Best tradeoff between speed and accuracy
• go is called opening cost gext extension cost

44

Arbitarary Gap Costs

The problem with general gap costs

• If we split an alignment within a gap, scores do not
add up

• Can’t use Needleman-Wunsch algorithm

Variant for arbitrary gap costs: Smith-Waterman-Beyer

Cost: Memory O(n^2) (same as NW)
CPU O(n^3)

45

Affine Gap Costs

• Reasonable approximation to realistic gap costs
• Computationally efficient

46

Gotoh’s Algorithm

To correctly score a gap in the last column we need to know what the previous column looked like.

Introduce 3 dynamic programming tables:

Mij score of best alignment ending in match
Eij score of best alignment ending with gap in y
Fij score of best alignment ending with gap in x

Gotoh’s Recursions siehe Vorlesung 4 letzen Folien

47

Database Searching

• Given a query sequence, find all similar sequences in
a large database

• Perhaps the most common bioinformatics task

• Need to compute semi-global or local alignment of
query and database

• Usually local alignment (more general)

48

BLAST and FASTA Heuristics

• Local alignment with Smith-Waterman in O(n · m) time

• Too slow for searching very large databases

• Need heuristics that run much faster

• Won’t have guarantee to find optimal alignment

• Idea: good alignments should contain stretches that
are identical

• seed-and-extend strategy:
find small exact matches, extend to full local
alignment

49

FASTA

1) Find all exact matches of length k (word length)
between query and database

2) Combine consecutive and overlapping exact
matches

3) Join matches on the same diagonal to form diagonal
runs → short gap free alignments

4) Score diagonal runs (e.g. BLOSUM) keep top N runs
(e.g N = 10)

5) Combine diagonal runs by allowing gaps between
them

6) Compute final local alignment using banded
Smith-Waterman

• Speed gain comes from the first step. Exact matching
in linear (O(n + m)) time

• Recommended values for k, DNA: k = 6, protein k = 2

• Finding remote protein homologies may require k = 1
→ poor performance

G D G K G
G E G R G

No k = 2 match, but very similar sequences
Even with k = 1 some alignments can be missed

50

BLAST

Most used bioinformatics program of all time
• Similar to FASTA but
• Initial matches need not be exact
• instead exceed a threshold score T

• Example: Word length w = 4, T = 17, PAM-120 matrix
Query contains the sub-sequence (word) ACDE

s (ACDE
ACDE ) = 3 + 9 + 5 + 5 = 22 good
s (ACDE
GCDE ) = 1 + 9 + 5 + 5 = 20 good
s (ACDE
AADE ) = 3 − 3 + 5 + 5 = 10 bad
. . .
• We can again use fast exact matching algorithms
by first making a list of all high-scoring words
• Typically ≈50 words per residue in query

1) Decompose query into overlapping words of length w

2) Generate list of all possible w-words that score > T to
some word of the query

3) Scan database for all words in this list

4) Extend these seed hits until score falls some amount
X below the maximum so far

5) Return highest score observed → locally maximal high
scoring pair (HSP)

6) Compute statistical significance of each HSP (below)

The original BLAST only generates gap-free alignments (HSPs)
Typically, w = 3 for proteins, w = 11 for DNA/RNA

51

BLAST2

Classical BLAST spends too much time extending hits

BLAST2: Two-hit-method

• Combine seed hits on the same diagonal with
distance < A

• Require at least two hits before reporting an HSP

• Needs less stringent threshold T

52

Gapped BLAST

Classical BLAST spends too much time extending hits

• If score of HSP exceeds threshold Sg

• Fix match in the middle of highest scoring 11 residue
segment

• Extend using modified Smith-Waterman in both
directions

• Stop when score drops below some threshold

53

BLAST Programs

The BLAST program comes in several variants:

• BLASTN: compares a DNA query sequence to a DNA
sequence database

• BLASTP: compares a protein query sequence to a
protein sequence database,

• TBLASTN: compares a protein query sequence to a
DNA sequence database (6 frames translation)

• BLASTX: compares a DNA query sequence (6 frames
translation) to a protein sequence database

• TBLASTX: compares a DNA query sequence (6
frames translation) to a DNA sequence database (6
frames translation).

54

BLAST Statistics: The E-value

Given an HSP with score S. How often do we expect to achieve a hit with score S or better when scanning a database of random sequences?

E-value plausibility arguments:
• Has to be proportional to database size n
• By symmetry, proportional to query m
• Should fall exponentially with score
(Probability of a hit with score 2S equals probability of two hits with score S in a row)
• → E(S) = Kmne−λS

55

BLAST Statistics: Where to get the parameters K and λ?

• K < 1 expresses that the effective lengths n and m are
slightly smaller

• λ is related to the scoring matrix

Log odds scores:

s(a, b) = 1/ log pab/
λ qaqb

pab = qaqb e ^λs(a,b)

56

Score Distributions: What should we expect for scores of random alignments?

• Align query to random position in database.
How would the scores be distributed?

Sum of many independent values→ normal distribution

• Now choose the best position in the database.
What is the distribution now?

Take n values from a normal distribution, choose the
maximum → Extreme value (Gumbel) distribution

f(x) = 1/ exp (−(x−µ)/− e^−(x−µ)/β)
β β

K and λ in E-value, related to µ and β parameters of the
Gumbel distribution

57

Score Statistics

- Gap free HSPs: λ and k (µ and λ) follow from score
matrix

More complicated models (e.g. alignments with gaps):
Postulate that alignment scores should still be extreme value distributed

1) Produce a large number of random alignments
2) Fit parameters of the Gumbel distribution
3) Derive the values of K and λ for E-values

Allows to compute empirical E-values

58

P-values

Alternative to E-value:

What is the probability of obtaining at least one hit with score S or better?

• Number of hits should be Poisson distributed
P(x = k) = µ^k/ e^−µ
k!
• Poisson distribution gives the probability of k rare
events

Limit of the Binomial distribution for n → ∞ np = µ

• In our case µ = E(S)

P(x > 0) = 1 − P(x = 0) = 1 − e^−E

• P-value and E-value are directly connected

59

E- and P-values
For small E-values (very good hits) E ≈ P

E p
10 0.9999546
5 0.9932621
2 0.8646647
1 0.6321206
0.1 0.0951626
0.05 0.0487706
0.001 0.0009995
0.0001 0.0001000

60

Bit Scores

• E- and P-values depend on scoring matrix and
database size

• Values from BLAST with different parameters are not
comparable

• How to “normalize” scores to make them comparable?

• Bit scores:
S′= (λS − lnK)/ ln 2

• Using bit scores we have:
E = mn · 2^−S

61

Summary Blast scores:

Raw alignment scores
Bit scores
P-values, E-values

• Raw alignment scores
Not comparable between different scoring matrices,
unclear significance

• Bit scores
Normalized and therefore comparable, independent
of database size

• P-values, E-values
Measure statistical significance, take size of database
into account

• True homologs may well have E-values > 1

• Protein homologs become hard to detect below 24%
identity — so-called “twilight zone”

62

Multiple Sequence Alignments

One or two homologous sequences whisper . . . a full multiple alignment shouts out loud.

Hubbard et al. 1996

63

Why Multiple Sequence Alignments?

• Allows to infer phylogenetic relationships

• Identify conserved regions – these are usually
functionally important

• Infer facts about protein 3D structure
• Infer secondary structure
• Distinguish hydrophobic inside from polar outside
• Identify transmembrane helices
• Infer long-range interactions

• Infer which mutations are acceptable without
destroying function

64

Multiple Sequence Alignments

-MSAs identify highly conserved (functionally important)
regions

-MSAs help predict secondary structure and solvent
accessible surface

-The structure of N-acetylglucosamine-binding proteins
is held together by 4 disulphid bridges formed by 8
conserved cysteins.

Given sequences A1 . . . Ar , Ai = ai,1, ai,2, . . . ai,ni
, a multiple sequence alignment (MSA) A∗ is obtained by inserting gaps (“-”) such that all resulting sequences A∗i have equal length n, and no column consists of gaps only.

65

Definition: Multiple Sequence Alignments

Given sequences A1 . . . Ar , Ai = ai,1, ai,2, . . . ai,ni , a multiple sequence alignment (MSA) A∗ is obtained by inserting gaps (“-”) such that all resulting sequences A∗i have equal length n, and no column consists of gaps only.

66

Pairwise vs Multiple Alignment

A multiple sequence alignment induces pairwise alignments, but not vice versa.

− a p p l e −
p a p − − e r
p e p p − e r

These three pairwise alignments are inconsistent
There exists no MSA that includes all three

1)
− a p p l e −
p a p − − e r

2)
− a p p l e −
p e p p − e r

3)
p a p − e r
p e p p e r

67

Sum of Pair Score

The most commonly used score for MSAs is the Sum of Pair Score,

i.e. the sum scores of all induced pairwise alignments

often used as weighted sum of pair score

68

Sum of Pair Score(Shortcomings:)

• Does not consider phylogeny
• Does not treat almost conserved columns right

69

Sum of Pair Score

. . . N . . .
.
.
.
. . . N . . .
. . . C . . .

• The more “N” we see the more confident we are that the “N” is important
• ... and the more we want to penalize a mismatch

• However, SoP score is r (r−1)/ σ(N, N) + rσ(N, C)
2
• ... the larger r the smaller the relative penalty for
adding a “C”

70

Needleman Wunsch for 3 Sequences

• Multiple alignment can be solved by a multi
dimensional Needleman Wunsch dynamic
programming algorithm

• For 3 sequences, 7 possibilities for final column

• Effort in Memory and CPU O(n^3)

71

Multiple Alignment for more then 3 Sequences

• Dynamic programming can be generalized to r
sequences

• Need to consider 2r−1 cases for the last alignment
column

• Need to fill a r dimensional table
• Effort in Memory and CPU O(n^r)

• MSA is exponential in the nmber of seqences
NP hard problem

• rarely used for more than 3 sequences

• Heuristics needed for fast computation of MSAs

72

Aligning Alignments

Suppose we have two good alignments, e.g. for two phylgenetic groups

How do we get an alginment of all sequences?

• Build a new MSA from the individual sequences
Would destroy all the good work on the two
subalignments

• Somehow join the two alignments . . .

• An alignment is just a sequence on a more
complicated alphabet

• We can adapt pairwise alignment algorithms to align
two alignments

• Only thing that changes is the scoring of matches

73

Profile Alignments

• Gaps are inserted into all rwos of the first or second
alignment

• None of the the existing gaps are ever lost
“Once a gap, always a gap”

• Called profile (or profile-profile) alignments

74

Profiles

Profiles compress a sequence alignment into a vector of frequencies

C G − A
C A T A
A A A A
− A − A

A 25% 75% 25% 100%
C 50% 0% 0% 0%
T 0% 0% 25% 0%
G 0% 25% 0% 0%
− 25% 0% 50% 0%

Profiles-profile alignments don’t need the full alignment

75

Progressive Multiple Alignment

Profile alignments can be used to build MSAs in steps

• Start from single sequences

• In each step perform a pairwise alignment, i.e.:
Algin two sequences, a sequence with an
alignment,or two alignments

• Repeat until we’re left with one MSA of all sequences

• Number of sequences / alignments reduced by 1 in
each step

• r − 1 pairwise alignments to produce an MSA or r
sequences

• CPU effort O(r · n^2) for r sequences of length n

76

Progressive Multiple Alignment

In what order should we align sequences?

• Best start with sequences that are most similar

• Leave the difficult alignments for last, when we have
more sequences

• Most common choice:
Build a phylogentic tree, align along the tree

Typical progressive alignment

• Perform pairwise alignments of all-against-all
sequences

• Build a guide tree (e.g. neighbor joining) from
pairwise distances

• Perform a progressive alignment following the guide
tree

• Effort O(r^2· n^2+ r · n^2)

77

Limitations of Progressive Multiple Alignment

Progressive is a heuristic that greatly reduces the effort for computing an MSA

• The progressive alignment is usually not optimal
• The is no overall score that is optimized
• Main problem: “Once a gap always a gap”

but gaps inserted early may be wrong

78

Example: Progressive Multiple Alignment:

Example:
Sequences x = GAAGTT, y = GACTT, z = GTACTT
x : G A A G T T
y : G A C − T T

x : G A A G T T
y : G A − C T T

x : G A A G T T
y : G A C − T T
z : G T A C T T

x : G A A G T T
y : G A − C T T
z : G T A C T T

Correct alignment of x and y only apparent when considering z

79

Beyond simple progressive multiple
Alignment

Progressive alignment is fast, but can yields poor results. How to improve it?

- Iterative refinement
Start with a progressive alignment, then try to make it
better

- Improved progressive alignment schemes
Try to include information on all sequences in every
step

- Consistency based approaches

80

MUSCLE – OVERVIEW

• Basic Idea: Improve progressive alignments
through iterative refinement

• 3 stages of the algorithm

• At the completion of each, a multiple alignment
is available and the algorithm can be terminated

• Significant improvement in accuracy and speed

81

MUSCLE – THE ALGORITHM

Stage 1: Draft Progressive – Builds a progressive
alignment

• Similarity of each pair of sequences is computed
using
• K-mer counting
• Constructing a global alignment and
determining fractionalidentity of the
sequences

• A tree is constructed and a root is identified

• A progressive alignment is built by following the
branching order of the tree, yielding a multiple
alignment

82

MUSCLE – THE ALGORITHM

Stage 2: Improved Progressive – Improves the tree

• Similarity of each pair of sequences is computed
using fractional identity from the mutual alignment

• A tree is constructed by applying a clustering method
to the distance matrix

• The trees are compared; a set of nodes for which the
branching order has changed is identified

• A new alignment is built, the existing one is retained if
the order is unchanged

83

MUSCLE – THE ALGORITHM

Stage 3: Refinement – Iterative Refinement

• An edge is deleted from a tree, dividing the
sequences into two disjoint subsets

• The profile (MA) of each subset is extracted

• The profiles are re-aligned to each other

• The score is computed, if the score has improved, the
alignment is retained, otherwise it is discarded

• Algorithm terminates at convergence

84

T-COFFEE

Consistency based progressive alignment

85

PROBCONS what does it stand for?

Probabilistic Consistency based

86

PROBCONS

• Alignment generation can be directly modeled as a
first order Markov process involving state emissions
and transitions

• Uses maximum expected accuracy alignment method

• Probabilistic consistency used as a scoring function
• Model parameters obtained using unsupervised
maximum likelihood methods

• Incorporate multiple sequence information in scoring
pairwise alignments

87

HIDDEN MARKOV MODEL

• Deletion penalties on Match => Gap transitions

• Extension penalties on Gap => Gap transitions

• Match/Mismatch penalties on Match emissions