8. BLAST Flashcards
(52 cards)
What pairwise alignment approaches are there?
dynamic programming (local, global)
heuristics
Dynamic programming approaches for pairwise alignment:
What do they guarantee?
Are they feasible?
guarantee the optimal alignment for two sequences
Work for small applications, but not feasible for big datasets - too slow or memory intensive for many applications
However, part of many approaches/programs!
What are two heuristics for alignment that are based on dynamic programming?
BLAST, FASTA.
BLAST is a heuristic.
What are two dynamic programming algorithms which BLAST takes some inspiration from?
Can they be used for genomescale alignments or big datasets?
1st DP for sequence analysis: NeedlemanWunsch (global alignments)
Then: SmithWaterman (local alignments).
too computationally intensive for big datasets
Heuristics make alignments practical for big databases/ datasets eg:
- NCBI
- from NextGeneration Sequencing (NGS).
Heuristic approaches for pairwise alignment (vs DP):
Basic idea to speed up?
exclude unpromising regions to speed up the search
Heuristic approaches for pairwise alignment (vs DP):
Name 4 approaches to allow speeding up?
Name 3 Applications?
Approaches:
- suffix trees
- dotplots
- pre-processing
- using substitution matrices, etc
Applications:
- lots of short NGS reads & reference: read mapping
- very long sequences: genome alignment
- lots of (gene) sequences: database search
MUMmer: genome alignment
stands for?
Maximal Unique Matches between strains: series of matches in the same order, translocations
MUMmer: genome alignment
goal?
data structure used?
goal:
globally compare two (similar) genomes
data structure: suffix tree
MUMmer: genome alignment
use/aim?
approach?
speed?
used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed
approach:
* construct a suffix tree of a reference sequence
* stream a query sequence against the suffix tree
* identify short exact matches between the two sequences
- use these directly
- extend to longer inexact alignments
* very fast!
Sequence alignment starting with:
one sequence vs many sequences
we want to?
database search - find database sequences that are similar (homologous) to the query sequence:
- identify similar (regions between) sequences
- collect related sequences for comparative analysis
- (taxonomically) identify a sequence (fragment)
- …
What are the results of a database search?
- alignment(s) between a query sequence and one or more database sequences
- ranked by statistical significance (E-value)
BLAST
basic steps that you do before BLAST search?
You:
* select query sequence
* select database to be searched
* pre-process / format database
* decide which BLAST program to use
* select parameters for search
* execute BLAST search
* interpret biological significance
BLAST:
basic steps that BLAST does?
- seeding
- extension to a good longer alignment
- evaluation of statistical significance
- presentation of ranked alignments
What BLAST programs are there?
QUERY / DATABASE / PROGRAM nucleotide / nucleotide / blastn nucleotide (translated) / nucleotide (translated) / tblastx nucleotide (translated) / peptide / blastx peptide / peptide / blastp peptide / nucleotide (translated) / tblastn
What BLAST variants are there?
- MEGABLAST: find highly similar DNA sequences
- PSI-BLAST: find distant members of a protein family or build a custom position-specific score matrix
- and many more
How BLAST works: seeding
What are used as seeds and why?
compile short words from query that provide high scores, look for identical matches to these words in the database
the locations of all high scoring word neighbors (word hits) in the db sequences are identified and used as alignment seeds
Why?
- inexact matching is slow, exact matching is fast
- biologically significant matches contain short regions of identities or high-scoring matches
How BLAST works: seeding
What counts as a word hit?
- Query is broken down into short overlapping words - default: 11 nt or 3 (5) aa
- up to 50 high scoring word neighbors with minimum score T are determined for each word in the query
How BLAST works: seeding
eg which scoring matrix?
BLOSUM62
What is the most time-consuming step of BLAST?
extension
BLAST:
What is the name of the algorithm for extension?
two-hit algorithm (1997)
BLAST:
What can be discarded before extension and why?
Which segment pairs are extended?
- most database sequences can be discarded before the extension step (if they don’t have any segment pairs)
- segment pairs on the same diagonal and within c cells of each other are extended in both directions
BLAST:
when does extension stop?
extension proceeds until drop-off from highest score is too great:
- X: value of how much the score is allowed to drop off since the last maximum
- if X is reached, the alignment is trimmed back to the point with the maximum score
BLAST: What happens with extended regions once extension has stopped?
extended regions are joined to form a gapped alignment, or high-scoring segment pair (HSP)
What are HSPs and how are they evaluated?
High-scoring segment pair (HSP): extended regions joined to form a gapped alignment
each HSP score is evaluated
- the statistical significance of HSPs are determined
- are two sequence fragments significantly more similar than expected by chance?
- HSPs are ranked by their E(xpect)-value and reported