Lectures Flashcards

Question

Smith Waterman solution:

Answer 1

• Idea: • We’re not interested in poor alignments (score < 0) • Compute only 1 local alignment for every endpoint xi and yj • Same recursion as Needleman-Wunsch, except: 1 Initialize with 0 as in free end gap alignment 2 Replace negative scores with 0, marks start of a new alignment

Answer 2

Let s(a, b) be the similarity score for aligning characters a, b, g the penalty for a gap (indel). Let Si,j be the score of the best local alignment ending at position i in x and j in y. Same effort as Needleman Wunsch O(n · m) in memory and time!

Answer 3

* Find the highest entry in the table * Begin normal backtracing from highest entry * Stop as soon as you encouter a 0

Answer 4

Smith Waterman Deblocking

Answer 5

• remove cells used in the first alignment by setting them 0 • “repair” table by recomputing cells that might have changed • backtrace in new table • repeat for as many alignments as desired

Answer 6

* Input: two potentially equivalent sequences * Goal: identify conserved regions and differences * Algorithm: Needleman-Wunsch dynamic programming * Applications: comparing two related genes/proteins

Answer 7

* Input: two sequences, one short and one long * Goal: is the short one a part of the long one? * Algorithm: free end gap modification of NW DP * Applications: sequence assembly, overlap detection

Answer 8

• Input: two sequences that may or may not be related • Goal: find regions of similarity between the sequences • Algorithm: Smith-Waterman dynamic programming • Applications: searching for local similarities in large sequences, identification of conserved domains/motifs

Answer 9

For DNA/RNA mostly ad-hoc score models used • positive scores for matches • negative for mismatches • sometimes better score for transitions (G-A, T-C) than for transversions (C-G, C-A, T-G, T-A) • Negative scores are essential for local alignments • More precisely, the expected score for random sequences should be negative

Answer 10

For DNA/RNA mostly ad-hoc score models used • positive scores for matches • negative for mismatches • sometimes better score for transitions (G-A, T-C) than for transversions (C-G, C-A, T-G, T-A) • Negative scores are essential for local alignments • More precisely, the expected score for random sequences should be negative

Answer 11

• P(A, B) is the joint probability for A and B happening • P(A|B) is the conditional probability, i.e. that A happens given that we know B • Bayes’ theorem for conditional probabilities P(A|B) · P(B) = P(A, B) = P(B|A) · P(A)

Answer 12

For amino acids we need something more sophisticated • We want to distinguish between homologous and non-homologous sequences • Compare two models: Is the alignment more likely due to homologous or random sequences? • For simplicity consider gap free alignments Compute P(x, y|R) (random model) and P(x, y|M) (model for homologous sequences • Assume each column of the alignment is independent

Answer 13

``` Let q(a) by the frequency of the letter a in a random sequence, and p(a, b) the probability of a column a b in an alignment of homologous sequences. ``` Working with the logarithm of probabilities is more convenient with s(a, b) = log (p(a, b))/ (q(a)q(b)) Log odds scores such as this are used widely. In general, the take the form score = log observed freq/ expected freq

Answer 14

They also need to depend on time, i.e. p(a, b|t)

Answer 15

• Margaret Dayhoff (1978) used hand crafted alignments of 71 protein families • Estimated amino acid frequencies ∑a q(a) = 1 • Counted observed substitutions fa,b= f b,a • Total number of AA in substitutions * Introduced PAM as time unit * 1 PAM = time for 1 substitution per 100 amino acids Probability for substitution a ⇒ b after time t P(b|a, t) = p(a,b|t)/ q(a) • Substitution score for sequences with distance kPAM s(a, b|k PAM) = log P(b|a,k PAM)/ q(b) • Matrix M with Mab = P(b|a, 1PAM) then P(b|a, k PAM) = Mab(k) = (Mk)ab • Allows to compute substitution score for arbitrary PAM distance!

Answer 16

PAM is a time unit, not equal to the number of observed mutations. We cannot see multiple substitutions of a single site. %-Difference PAM %-Difference PAM 1 1 45 67 5 5 50 80 10 11 55 94 15 17 60 112 20 23 65 133 25 30 70 159 30 38 75 195 35 47 80 246 40 56 85 328

Answer 17

* Modern alternative to PAM matrices * Uses conserved regions of distantly related proteins * Series matrices for different relatedness • E.g.: BLOSUM 60 constructed from pairwise alignments with at most 60% identity • Closely related sequences: low PAM, high BLOSUM matrix • E.g.: PAM250 ⇔ BLOSUM45.

Answer 18

A T A C − − − C C − − A C G G G C − • Example: Three gaps of length 2, 3 and 1 cost = g(2) + g(3) + g(1) * So far we have used linear gap costs g(k) = k · g1 * Long gaps treated as k indels of length 1 • Unrealistic: A single mutation event often inserts / deletes several characters • An alignment with few long gaps is more plausible than one with many short gaps • Linear gap costs produce alginments with too many short gaps and very few long ones

Answer 19

• g(k) should be subadditive g(k + l) ≤ g(k) + g(l) * Examples: * g(k) = go + b · k^2 — not subadditive * g(k) = go + b ln(k) — most realistic * g(k) = go + gext · (k − 1) — Affine gap cost * Affine costs are most commonly used. * Best tradeoff between speed and accuracy * go is called opening cost gext extension cost

Answer 20

The problem with general gap costs • If we split an alignment within a gap, scores do not add up • Can’t use Needleman-Wunsch algorithm Variant for arbitrary gap costs: Smith-Waterman-Beyer ``` Cost: Memory O(n^2) (same as NW) CPU O(n^3) ```

Answer 21

* Reasonable approximation to realistic gap costs | * Computationally efficient

Answer 22

To correctly score a gap in the last column we need to know what the previous column looked like. Introduce 3 dynamic programming tables: Mij score of best alignment ending in match Eij score of best alignment ending with gap in y Fij score of best alignment ending with gap in x Gotoh’s Recursions siehe Vorlesung 4 letzen Folien

Answer 23

• Given a query sequence, find all similar sequences in a large database • Perhaps the most common bioinformatics task • Need to compute semi-global or local alignment of query and database • Usually local alignment (more general)

Answer 24

* Local alignment with Smith-Waterman in O(n · m) time * Too slow for searching very large databases * Need heuristics that run much faster * Won’t have guarantee to find optimal alignment • Idea: good alignments should contain stretches that are identical • seed-and-extend strategy: find small exact matches, extend to full local alignment

Answer 25

1) Find all exact matches of length k (word length) between query and database 2) Combine consecutive and overlapping exact matches 3) Join matches on the same diagonal to form diagonal runs → short gap free alignments 4) Score diagonal runs (e.g. BLOSUM) keep top N runs (e. g N = 10) 5) Combine diagonal runs by allowing gaps between them 6) Compute final local alignment using banded Smith-Waterman ``` • Speed gain comes from the first step. Exact matching in linear (O(n + m)) time ``` • Recommended values for k, DNA: k = 6, protein k = 2 • Finding remote protein homologies may require k = 1 → poor performance G D G K G G E G R G No k = 2 match, but very similar sequences Even with k = 1 some alignments can be missed

Answer 26

Most used bioinformatics program of all time • Similar to FASTA but • Initial matches need not be exact • instead exceed a threshold score T • Example: Word length w = 4, T = 17, PAM-120 matrix Query contains the sub-sequence (word) ACDE s (ACDE ACDE ) = 3 + 9 + 5 + 5 = 22 good s (ACDE GCDE ) = 1 + 9 + 5 + 5 = 20 good s (ACDE AADE ) = 3 − 3 + 5 + 5 = 10 bad . . . • We can again use fast exact matching algorithms by first making a list of all high-scoring words • Typically ≈50 words per residue in query 1) Decompose query into overlapping words of length w 2) Generate list of all possible w-words that score > T to some word of the query 3) Scan database for all words in this list 4) Extend these seed hits until score falls some amount X below the maximum so far ``` 5) Return highest score observed → locally maximal high scoring pair (HSP) ``` 6) Compute statistical significance of each HSP (below) The original BLAST only generates gap-free alignments (HSPs) Typically, w = 3 for proteins, w = 11 for DNA/RNA

Answer 27

Classical BLAST spends too much time extending hits BLAST2: Two-hit-method • Combine seed hits on the same diagonal with distance < A * Require at least two hits before reporting an HSP * Needs less stringent threshold T

Answer 28

Classical BLAST spends too much time extending hits • If score of HSP exceeds threshold Sg • Fix match in the middle of highest scoring 11 residue segment • Extend using modified Smith-Waterman in both directions • Stop when score drops below some threshold

Answer 29

The BLAST program comes in several variants: • BLASTN: compares a DNA query sequence to a DNA sequence database • BLASTP: compares a protein query sequence to a protein sequence database, • TBLASTN: compares a protein query sequence to a DNA sequence database (6 frames translation) • BLASTX: compares a DNA query sequence (6 frames translation) to a protein sequence database • TBLASTX: compares a DNA query sequence (6 frames translation) to a DNA sequence database (6 frames translation).

Answer 30

Given an HSP with score S. How often do we expect to achieve a hit with score S or better when scanning a database of random sequences? E-value plausibility arguments: • Has to be proportional to database size n • By symmetry, proportional to query m • Should fall exponentially with score (Probability of a hit with score 2S equals probability of two hits with score S in a row) • → E(S) = Kmne−λS

Answer 31

• K < 1 expresses that the effective lengths n and m are slightly smaller • λ is related to the scoring matrix Log odds scores: s(a, b) = 1/ log pab/ λ qaqb pab = qaqb e ^λs(a,b)

Answer 32

• Align query to random position in database. How would the scores be distributed? Sum of many independent values→ normal distribution • Now choose the best position in the database. What is the distribution now? Take n values from a normal distribution, choose the maximum → Extreme value (Gumbel) distribution f(x) = 1/ exp (−(x−µ)/− e^−(x−µ)/β) β β K and λ in E-value, related to µ and β parameters of the Gumbel distribution

Answer 33

- Gap free HSPs: λ and k (µ and λ) follow from score matrix More complicated models (e.g. alignments with gaps): Postulate that alignment scores should still be extreme value distributed 1) Produce a large number of random alignments 2) Fit parameters of the Gumbel distribution 3) Derive the values of K and λ for E-values Allows to compute empirical E-values

Answer 34

Alternative to E-value: What is the probability of obtaining at least one hit with score S or better? • Number of hits should be Poisson distributed P(x = k) = µ^k/ e^−µ k! • Poisson distribution gives the probability of k rare events Limit of the Binomial distribution for n → ∞ np = µ • In our case µ = E(S) P(x > 0) = 1 − P(x = 0) = 1 − e^−E • P-value and E-value are directly connected

Answer 35

E p 10 0.9999546 5 0.9932621 2 0.8646647 1 0.6321206 0.1 0.0951626 0.05 0.0487706 0.001 0.0009995 0.0001 0.0001000

Answer 36

• E- and P-values depend on scoring matrix and database size • Values from BLAST with different parameters are not comparable • How to “normalize” scores to make them comparable? • Bit scores: S′= (λS − lnK)/ ln 2 • Using bit scores we have: E = mn · 2^−S

Answer 37

• Raw alignment scores Not comparable between different scoring matrices, unclear significance • Bit scores Normalized and therefore comparable, independent of database size • P-values, E-values Measure statistical significance, take size of database into account • True homologs may well have E-values > 1 • Protein homologs become hard to detect below 24% identity — so-called “twilight zone”

Answer 38

One or two homologous sequences whisper . . . a full multiple alignment shouts out loud. Hubbard et al. 1996

Answer 39

• Allows to infer phylogenetic relationships • Identify conserved regions – these are usually functionally important * Infer facts about protein 3D structure * Infer secondary structure * Distinguish hydrophobic inside from polar outside * Identify transmembrane helices * Infer long-range interactions • Infer which mutations are acceptable without destroying function

Answer 40

-MSAs identify highly conserved (functionally important) regions -MSAs help predict secondary structure and solvent accessible surface -The structure of N-acetylglucosamine-binding proteins is held together by 4 disulphid bridges formed by 8 conserved cysteins. Given sequences A1 . . . Ar , Ai = ai,1, ai,2, . . . ai,ni , a multiple sequence alignment (MSA) A∗ is obtained by inserting gaps (“-”) such that all resulting sequences A∗i have equal length n, and no column consists of gaps only.

Answer 41

Given sequences A1 . . . Ar , Ai = ai,1, ai,2, . . . ai,ni , a multiple sequence alignment (MSA) A∗ is obtained by inserting gaps (“-”) such that all resulting sequences A∗i have equal length n, and no column consists of gaps only.

Answer 42

A multiple sequence alignment induces pairwise alignments, but not vice versa. − a p p l e − p a p − − e r p e p p − e r These three pairwise alignments are inconsistent There exists no MSA that includes all three 1) − a p p l e − p a p − − e r 2) − a p p l e − p e p p − e r 3) p a p − e r p e p p e r

Answer 43

The most commonly used score for MSAs is the Sum of Pair Score, i.e. the sum scores of all induced pairwise alignments often used as weighted sum of pair score

Answer 44

* Does not consider phylogeny | * Does not treat almost conserved columns right

Answer 45

``` . . . N . . . . . . . . . N . . . . . . C . . . ``` * The more “N” we see the more confident we are that the “N” is important * ... and the more we want to penalize a mismatch • However, SoP score is r (r−1)/ σ(N, N) + rσ(N, C) 2 • ... the larger r the smaller the relative penalty for adding a “C”

Answer 46

• Multiple alignment can be solved by a multi dimensional Needleman Wunsch dynamic programming algorithm * For 3 sequences, 7 possibilities for final column * Effort in Memory and CPU O(n^3)

Answer 47

• Dynamic programming can be generalized to r sequences • Need to consider 2r−1 cases for the last alignment column * Need to fill a r dimensional table * Effort in Memory and CPU O(n^r) • MSA is exponential in the nmber of seqences NP hard problem * rarely used for more than 3 sequences * Heuristics needed for fast computation of MSAs

Answer 48

• Build a new MSA from the individual sequences Would destroy all the good work on the two subalignments • Somehow join the two alignments . . . • An alignment is just a sequence on a more complicated alphabet • We can adapt pairwise alignment algorithms to align two alignments • Only thing that changes is the scoring of matches

Answer 49

• Gaps are inserted into all rwos of the first or second alignment • None of the the existing gaps are ever lost “Once a gap, always a gap” • Called profile (or profile-profile) alignments

Answer 50

Profiles compress a sequence alignment into a vector of frequencies C G − A C A T A A A A A − A − A ``` A 25% 75% 25% 100% C 50% 0% 0% 0% T 0% 0% 25% 0% G 0% 25% 0% 0% − 25% 0% 50% 0% ``` Profiles-profile alignments don’t need the full alignment

Answer 51

Profile alignments can be used to build MSAs in steps • Start from single sequences • In each step perform a pairwise alignment, i.e.: Algin two sequences, a sequence with an alignment,or two alignments • Repeat until we’re left with one MSA of all sequences • Number of sequences / alignments reduced by 1 in each step • r − 1 pairwise alignments to produce an MSA or r sequences • CPU effort O(r · n^2) for r sequences of length n

Answer 52

• Best start with sequences that are most similar • Leave the difficult alignments for last, when we have more sequences • Most common choice: Build a phylogentic tree, align along the tree Typical progressive alignment • Perform pairwise alignments of all-against-all sequences • Build a guide tree (e.g. neighbor joining) from pairwise distances • Perform a progressive alignment following the guide tree • Effort O(r^2· n^2+ r · n^2)

Answer 53

Progressive is a heuristic that greatly reduces the effort for computing an MSA * The progressive alignment is usually not optimal * The is no overall score that is optimized * Main problem: “Once a gap always a gap” but gaps inserted early may be wrong

Answer 54

Example: Sequences x = GAAGTT, y = GACTT, z = GTACTT x : G A A G T T y : G A C − T T x : G A A G T T y : G A − C T T x : G A A G T T y : G A C − T T z : G T A C T T x : G A A G T T y : G A − C T T z : G T A C T T Correct alignment of x and y only apparent when considering z

Answer 55

- Iterative refinement Start with a progressive alignment, then try to make it better - Improved progressive alignment schemes Try to include information on all sequences in every step - Consistency based approaches

Answer 56

• Basic Idea: Improve progressive alignments through iterative refinement • 3 stages of the algorithm • At the completion of each, a multiple alignment is available and the algorithm can be terminated • Significant improvement in accuracy and speed

Answer 57

Stage 1: Draft Progressive – Builds a progressive alignment • Similarity of each pair of sequences is computed using • K-mer counting • Constructing a global alignment and determining fractionalidentity of the sequences • A tree is constructed and a root is identified • A progressive alignment is built by following the branching order of the tree, yielding a multiple alignment

Answer 58

Stage 2: Improved Progressive – Improves the tree • Similarity of each pair of sequences is computed using fractional identity from the mutual alignment • A tree is constructed by applying a clustering method to the distance matrix • The trees are compared; a set of nodes for which the branching order has changed is identified • A new alignment is built, the existing one is retained if the order is unchanged

Answer 59

Stage 3: Refinement – Iterative Refinement • An edge is deleted from a tree, dividing the sequences into two disjoint subsets * The profile (MA) of each subset is extracted * The profiles are re-aligned to each other • The score is computed, if the score has improved, the alignment is retained, otherwise it is discarded • Algorithm terminates at convergence

Answer 60

Consistency based progressive alignment

Answer 61

Probabilistic Consistency based

Answer 62

• Alignment generation can be directly modeled as a first order Markov process involving state emissions and transitions • Uses maximum expected accuracy alignment method • Probabilistic consistency used as a scoring function • Model parameters obtained using unsupervised maximum likelihood methods • Incorporate multiple sequence information in scoring pairwise alignments

Answer 63

* Deletion penalties on Match => Gap transitions * Extension penalties on Gap => Gap transitions * Match/Mismatch penalties on Match emissions

Lectures Flashcards

(87 cards)