Weeks 5-6: Genome Assembly; BLAST/FASTA Flashcards
(36 cards)
The IHGSC used a _____ approach to sequence the genome
hierarchical
Celera used ______ to sequence the genome
whole genome shotgun
Pseudogene
DNA sequence resembling a gene but mutated into an inactive form over the course of evolution.
Often lacks introns and other essential DNA sequences necessary for function. Pseudogenes do not result in functional proteins, but may have regulatory effects
True or false:
98% of the genome doesn’t encode proteins
True. Only 2% of the genome encodes proteins.
The other 98% encodes small RNAs that regulate gene expression
Output of Sanger sequencing
Single sequence ranging from 500-1000 bp
Output of Next-Gen Sequencing
Groups sequences ranging from 25-500 bp
De novo assembly
Reconstruction of contiguous sequences without making use of any reference sequence.
Reads are partitioned into k-mers (substrings of the read sequence of length k)-form the nodes of the graph (network) and are linked when sharing a k-1 mer.
Genome annotation
Computationally expensive process attaching biologically relevant information to genome sequence data
Pre-Assembly Steps
FastQC: To check the quality of the sequencing data, overall GC content, repeat abundance, the proportion of duplicated reads.
Trim sequences: adapter trimming (cutadapt), trim reads based on quality (sickle).
Remove contaminant sequences such as DNA from the PhiX phage: use a short read aligner (such as Burrows-Wheeler Aligner)
Demultiplex reads: Galaxy
True or false:
Gel electrophoresis is required for NGS
False. No gel electrophoresis needed.
Smith-Waterman Search
Perform dynamic programming between query and each sequence in the collection.
Accurate - guaranteed to report the highest scoring alignments
Slow - searching a 52,000,000,000 basepair collection (entire GenBank database) takes around 3 days on a modern workstation
FASTA
First heuristic search algorithm
~5x faster than Smith-Waterman
Four stage search process - first stage based on algorithm of Wilbur and Lipman for finding exact matches of length n between query and collection sequences
Wilbur-Lipman Approach
- Ignore indel events
- Extract intervals (fixed-length overlapping subsequences from the first sequence of length n)
- Store intervals in fast search structure
- For each interval in the second sequence, search for it in the hash table
FASTA steps
Step 1: Identify regions shared by the two sequences of length n = 1 (using the Wilbur-Lipman method)
Step 2: Rescan the top-ten regions, and rescore using a scoring matrix (protein only)
Step 3: Check to see if initial regions can be joined to form rough alignment with gaps
Step 4: Perform banded Smith-Waterman location alignment centred around all regions that score greater than a threshold
BLAST
~50x faster than Smith-Waterman, 10x faster than FASTA but not 100% accurate
Stage 1: BLAST searches for hits (matches of length W between query and subject). Location of each hit is passed to stage 2.
Stage 2: BLAST performs an ungapped alignment of region surrounding each hit. High-scoring ungapped alignments (where score > T) are passed to stage 3
Stages 3 and 4: BLAST performs a gapped alignment of region surrounding each high-scoring ungapped alignment
High-scoring alignments are displayed to the user
Difference between BLAST protein and nucleotide searches
Blast Protein Search - Two-hits on the same diagonal (instead of just one) are required to trigger an ungapped alignment.
Why are index-based approaches not suitable for searching large collections?
Because the index ends up being much larger than the data itself
True or false:
Two proteins that are related in recent evolutionary terms will usually share sequence and structural similarity
True
PSI-BLAST
Used to detect distantly related homologues not detected by BLASTP. Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration.
PSI-BLAST
Used to detect distantly related homologues not detected by BLASTP. Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration. If new matches are found, another matrix is constructed. If no new matches found, hits with a E<1x10^-6 are recorded.
Significance threshold for PSI-BLAST
Experimental tests of PSI-BLAST using default parameters have determined that proteins identified in the first 20 iterations with expect scores <1x10^-6 are most likely real
Define: Domain
contiguous stretch of amino acids “that look as though they should have independent stability”.
Decreasing the e-value threshold reduces the likelihood of _______ but decreases ________.
false positives; sensitivity
False positives in PSI-BLAST searches are:
High-scoring alignments that are not in fact related to the query