Sequence Analysis Flashcards Preview

Bioinformatic > Sequence Analysis > Flashcards

Flashcards in Sequence Analysis Deck (12):
1

Alignment Based Methods

• Goal: find best alignment.
• Measure/Score: As few as possible introduction of gaps and substitutions.
• Question: How to achieve this?
• Approach: Pairwise vs multiple sequence alignment.

2

Edit distance invented by Levenshtein (1965)

Jeweils eine Änderung (Hinzufügen/Löschen eines Buchstaben, verändern...) = Distance +1

3

Damerau

• flip operations are one change
• brid (old english) ñ bird (new english) ñ 1
operation
• mistyping as "ebya"is more easily recognized by search engines in the web
• used as well in biology, spell checking, ...

4

Global Aligment "Needleman-Wunsch"

• gaps can get different scoring points than edits
• exchange matrix for different letter changes
• find global alignment --> Needleman-Wunsch
• opening and closing a gap can be punished
differentially --> Needleman-Wunsch-Gotoh
• find best local alignment --> Smith-Waterman
• the exchange matrix has smaller punish values for more similar letters
• example: as d/t are both dental sounds or leucin and isoleucin have similar biophysical properties

5

Smith-Waterman algorithem

• finding local (partial) optimal alignments
• align shorter with larger sequences
• changing from negative to positive view
• finding maximal score
• back tracking in the matrix from final score to starting point

6

Differences FASTA vs BLAST

FASTA: not so time consuming, first FAST
Algorithm
• FASTA and BLAST start with small good
alignments, try to extend, finally optimize best hits
• FASTA is derived from dot-plot
1) Identify common k-words (Nucleotides 6 letters, AA 2 letter)
2) Score dotplot diagonals
3) Rescore possibly by exchange matrix
4) Join regions over gaps, penalise for gaps
5) Dynamic programming to finalize alignments
➔ BLAST hat ein anderes Prinzip: Es wird zuerst nach der perfekten Übereinstimmung gesucht und dann nach verschieden langen anderen ähnlichen Stücken…

• Basic Local Alignment Search Tool
• compare single sequence to entire database of sequences
• compare two sequences
• much faster than FASTA
• BLAST is based on Poisson and Extreme Value distributions
• heuristic aproach (no brute force of all possible permutations)
• wordsize: 3 AA or 11 nucleotides per default, similarity
• gaps are not treated well
• Poisson-distribution of score values ñ P-Value
• E-value = P-value * Number of entries in the database

7

Alignment Significance

• generate random scores
• compute mean and sd from random scores
• compute the deviation from the real to the random
• Z-Score to E-score (probability of a Z-score)
• E-value: 10e-6 signicant
• E-value: 10e-3 might be ...
• E-value: > 10e-3 ignore ...

8

FASTA Variants

Protein:
• protein-protein FASTA (fasta).
• protein-protein Smith-Waterman (ssearch).
• global protein-protein (Needleman-Wunsch)
(ggsearch)
Nucleotide:
• nucleotide-nucleotide (DNA/RNA fasta)
• ordered nucleotides vs nucleotide (fastm)
• unordered nucleotides vs nucleotide (fasts)

9

multiple sequence alignement

MSA is for comparing homologous sequences
• Homologs: gene related to a second gene by descent from a common ancestral DNA sequence
- Orthologs: genes in different species that evolved from a common ancestral gene by speciation, normally retain function
- Paralogs: genes related by duplication within a genome,
might acquire new functions

three or more biological sequences (protein or nucleic
acid) of similar length. From the output, homology can be
inferred and the evolutionary relationships between the sequences studied.
By contrast, Pairwise Sequence Alignment tools are used to
identify regions of similarity that may indicate functional,
structural and/or evolutionary relationships between two biological sequences.

10

Progressive Alignments

• combining pairwise alignments by starting with most similar alignments
• initial guided tree, adding more sequences
• not garanteed to be globally optimal
• errors at the beginning might propagate to the end
• examples: ClustalW, MAFFT (fast but might give more errors), T-Coffee (slow but very accuarate)
• state of the art: Clustal Omega
• tradeoff between speed and accuracy ...

11

Iterative Alignment Methods

• similar to progressive methods
• but might realign initial alignments
• examples: MUSCLE, Dialign

12

Clustal Omega

Solves the problem of beeing fast and accurate.

Clustal Omega is a multiple sequence alignment program.
It produces biologically meaningful multiple sequence alignments of divergent sequences. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.