Sequence allignment Flashcards
(31 cards)
What is the goal of sequence alignment?
The goal of sequence alignment is to identify similarities between sequences, whether it be a global similarity or local similarity. This can be used to infer the homology, function, and structural traits of a sequence. This can be done with DNA, RNA, and proteins.
What is the difference between global and local alignment?
Global alignment compares the entire length of the sequences, while local alignment only compares regions of a sequence.
When is a global alignment preferred, and when is a local alignment preferred?
A global alignment compares the whole sequence. Therefore global alignment is best used on sequences with similar lengths and an overall higher similarity.
Local alignments are better suited for sequences that are evolutionarily distantly related and may therefore have a very low overall similarity. The local alignment is better suited to find conserved parts of the sequences since the low overall similarity would mask any conserved part when trying to align globally.
Which algorithms are used for global and local alignment?
For global alignment, the most common used algorithm is the needleman-wunsch algorithm. For local aligments, the most common is the smith-waterman algorithm. BLAST is also a commonly used algorithm for local alignment.
What is the log-odds ratio?
The log-odds ratio is a statistical method that calculates the probability that a given pair of residues will align based on the frequencies of those residues in a set of aligned sequences.
What is a gap in a sequence alignment?
A gap represents indels (insertions and deletions) in a sequence and is often negatively scored when trying to find the optimal alignment.
What is the difference between affine and linear gap penalty?
A linear gap penalty is a single negative score given to gaps in an alignment. This method does not differentiate between multiple and single residue gaps. This is preferred when single gaps are expected at a higher frequency than longer gaps.
The affine gap penalty introduces a gap opening and gap extension score. The extension score is often lower than the opening. This method treats multiple residue gaps worse than a single residue gap. However, the extension of a gap is treated more lightly than opening a new gap. This is preferred when longer gaps are expected to as frequent as single gaps.
What is the difference between the DNA/RNA gap penalty and the protein gap penalty?
For DNA sequences, the gap penalty is typically based on the number of nucleotides being inserted or deleted. For protein sequences, the gap penalty is typically based on the number of amino acids inserted or deleted and the sequence context. Some amino acids are more likely than others to appear in gaps.
What is dynamic programming in general?
Dynamic programming solves optimization problems by breaking them down into smaller subproblems and building up the solution from there. It is based on the idea of recursion, in which a problem is solved by solving smaller subproblems and then combining the solutions to the subproblems to solve the larger problem.
What are heuristic models in general?
Heuristic methods are techniques that provide a good solution to a problem in a reasonable amount of time but may not guarantee the optimal solution. Heuristic methods are often based on empirical data or experience. They are designed to find a solution that is close to the optimal solution in a relatively short amount of time. Heuristic models are often faster than dynamic programming but require some additional assumptions and will sometimes miss the best match for some sequence pairs.
What are scoring/substitution matrices?
A scoring matrix is a table that assigns a score to each possible pair of amino acid residues based on their similarity and likelihood of being substituted by each other.
The matrix is used to score the alignment and identify the optimal alignment based on the highest score. Different scoring matrices may be used depending on the type of aligned sequences and the desired sensitivity level.
What is Big O notation?
Big O notation is a mathematical notation used to describe the performance of an algorithm in terms of how the running time or space complexity grows as the input size increases.
What does it mean to say that an algorithm has a running time of O(n)?
An algorithm with a running time of O(n) has a complexity that grows linearly with the size of the input (n). This means that the running time of the algorithm will increase at a rate that is proportional to the size of the input. For example, if the input size doubles, the running time of the algorithm will also double.
What is the big O-notation for a global alignment with the Smith-Waterman algorithm?
The Big O notation of the Smith-Waterman algorithm for global alignment is O(mn), where m and n are the lengths of the two sequences being aligned. Since the sequences are approximately the same length O(N^2) can also be used.
What is the difference between the dynamic programming matrices from global and local alignment?
The best local alignment can end anywhere in the matrix. The best global alignment will always start from the bottom. In the local alignment, the cells in the matrix can take the value of 0 if all other options have a value less than 0. Taking option 0 corresponds to starting a new alignment.
How does BLAST work?
BLAST is a heuristic tool to find a good local alignment but not necessarily the best.
- BLAST creates a list of seeds or words from the query sequence. These words are usually 3 amino acids long or 8-12 base residues long. These seeds are spaced out along the query sequence.
- BLAST searches a database for matches to the seeds. If a seed matches a portion of a database sequence, the algorithm extends the seed on both sides to create a longer alignment.
- Continue extending the alignment until the best match is found: Unlike the FASTA algorithm, which stops the alignment once it reaches a certain threshold (e.g., 50% identity), BLAST continues to extend the alignment until it finds the best possible match. This can involve inserting gaps (indels) into the alignment to improve the overall match
- Once all of the seeds have been processed, the algorithm ranks the alignments based on the quality of the match, which is typically measured by the alignment score which is calculated by using a scoring matrix.
- The algorithm then outputs a list of the best alignments, ranked by the quality of the match.
How does FASTA work?
- FASTA creates a list of seeds from the query sequence. These seeds are spaced out evenly along the query sequence.
- FASTA then searches the sequences database for matches to these seeds. If a seed matches a portion of a database sequence, the algorithm extends the seed on both sides to create a longer alignment
- Unlike the BLAST algorithm, which continues to extend the alignment until it finds the best possible match, the FASTA algorithm stops the alignment once it reaches a certain threshold, such as 50% identity.
- Rank the alignments based on the quality of the match: Once all of the seeds have been processed, the algorithm ranks the alignments based on the quality of the match, which is typically measured by the percentage of identical residues in the alignment.
- The algorithm then outputs a list of the best alignments, ranked by the quality of the match.
What is the difference between FASTA and BLAST?
Sensitivity: BLAST is generally more sensitive than FASTA, meaning that it is more likely to find matches between a query sequence and a database of sequences.
Speed: FASTA is generally faster than BLAST
What is the BLAST E-value?
The BLAST E-value is the number of expected hits of similar quality (score) that could be found just by chance.
E-value of 10 means that up to 10 hits can be expected to be found just by chance, given the same size of a random database.
The smaller the E-value, the better the match.
What is a dot-plot used for?
To visually compare the similarity between two sequences. Each dot on the dot-plot represents a match between the residue in sequence x and y.
What are a benefit and a drawback of dot plots?
A dot plot can identify intrasequence repeats, meaning subsequences that appear multiple times in the sequence.
A dot plot suffers from background noise, making it hard to distinguish dot patterns arising from background noise from significant dot patterns.
How can the problem of background noise in a dot plot be solved?
The easiest way to overcome the background noise problem is to apply a filter that requires comparison to achieve some minimum identity score. This makes it so that only diagonals of a certain length will survive the filter and be shown in the final plot.
What is the minimum percentage of identity that can reasonably be accepted as significant as a measure of homology?
90% of sequence pairs with identity at or greater than 30% over their whole length were pairs of structurally similar proteins. Therefore 30% is a general threshold for an initial presumption of homology.
Between 30% and 20% is called the twilight zone, where homology may exist but cannot reliably be assumed.
Below 20% is the midnight zone, where homology is very unlikely.
What is the genetic or evolutionary distance?
A measure of the difference between two homologous sequences from different species.