Weeks 5-6: Genome Assembly; BLAST/FASTA Flashcards by Madiyar Zhanduisenov

The IHGSC used a _____ approach to sequence the genome

hierarchical

How well did you know this?

Not at all

Perfectly

Celera used ______ to sequence the genome

whole genome shotgun

How well did you know this?

Not at all

Perfectly

Pseudogene

DNA sequence resembling a gene but mutated into an inactive form over the course of evolution.
Often lacks introns and other essential DNA sequences necessary for function. Pseudogenes do not result in functional proteins, but may have regulatory effects

How well did you know this?

Not at all

Perfectly

True or false:

98% of the genome doesn’t encode proteins

True. Only 2% of the genome encodes proteins.

The other 98% encodes small RNAs that regulate gene expression

How well did you know this?

Not at all

Perfectly

Output of Sanger sequencing

Single sequence ranging from 500-1000 bp

How well did you know this?

Not at all

Perfectly

Output of Next-Gen Sequencing

Groups sequences ranging from 25-500 bp

How well did you know this?

Not at all

Perfectly

De novo assembly

Reconstruction of contiguous sequences without making use of any reference sequence.
Reads are partitioned into k-mers (substrings of the read sequence of length k)-form the nodes of the graph (network) and are linked when sharing a k-1 mer.

How well did you know this?

Not at all

Perfectly

Genome annotation

Computationally expensive process attaching biologically relevant information to genome sequence data

How well did you know this?

Not at all

Perfectly

Pre-Assembly Steps

FastQC: To check the quality of the sequencing data, overall GC content, repeat abundance, the proportion of duplicated reads.
Trim sequences: adapter trimming (cutadapt), trim reads based on quality (sickle).
Remove contaminant sequences such as DNA from the PhiX phage: use a short read aligner (such as Burrows-Wheeler Aligner)
Demultiplex reads: Galaxy

How well did you know this?

Not at all

Perfectly

True or false:

Gel electrophoresis is required for NGS

False. No gel electrophoresis needed.

How well did you know this?

Not at all

Perfectly

Smith-Waterman Search

Perform dynamic programming between query and each sequence in the collection.
Accurate - guaranteed to report the highest scoring alignments
Slow - searching a 52,000,000,000 basepair collection (entire GenBank database) takes around 3 days on a modern workstation

How well did you know this?

Not at all

Perfectly

FASTA

First heuristic search algorithm
~5x faster than Smith-Waterman
Four stage search process - first stage based on algorithm of Wilbur and Lipman for finding exact matches of length n between query and collection sequences

How well did you know this?

Not at all

Perfectly

Wilbur-Lipman Approach

Ignore indel events
Extract intervals (fixed-length overlapping subsequences from the first sequence of length n)
Store intervals in fast search structure
For each interval in the second sequence, search for it in the hash table

How well did you know this?

Not at all

Perfectly

FASTA steps

Step 1: Identify regions shared by the two sequences of length n = 1 (using the Wilbur-Lipman method)
Step 2: Rescan the top-ten regions, and rescore using a scoring matrix (protein only)
Step 3: Check to see if initial regions can be joined to form rough alignment with gaps
Step 4: Perform banded Smith-Waterman location alignment centred around all regions that score greater than a threshold

How well did you know this?

Not at all

Perfectly

BLAST

~50x faster than Smith-Waterman, 10x faster than FASTA but not 100% accurate
Stage 1: BLAST searches for hits (matches of length W between query and subject). Location of each hit is passed to stage 2.
Stage 2: BLAST performs an ungapped alignment of region surrounding each hit. High-scoring ungapped alignments (where score > T) are passed to stage 3
Stages 3 and 4: BLAST performs a gapped alignment of region surrounding each high-scoring ungapped alignment
High-scoring alignments are displayed to the user

How well did you know this?

Not at all

Perfectly

Difference between BLAST protein and nucleotide searches

Study These Flashcards

Blast Protein Search - Two-hits on the same diagonal (instead of just one) are required to trigger an ungapped alignment.

Why are index-based approaches not suitable for searching large collections?

Study These Flashcards

Because the index ends up being much larger than the data itself

True or false:

Two proteins that are related in recent evolutionary terms will usually share sequence and structural similarity

Study These Flashcards

True

PSI-BLAST

Study These Flashcards

Used to detect distantly related homologues not detected by BLASTP.
Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration.

PSI-BLAST

Study These Flashcards

Used to detect distantly related homologues not detected by BLASTP.
Several rounds (iterations) of BLAST are run. Between each round, a Position-Specific Scoring Matrix (PSSM) is constructed, used for the subsequent iteration. If new matches are found, another matrix is constructed. If no new matches found, hits with a E<1x10^-6 are recorded.

Significance threshold for PSI-BLAST

Study These Flashcards

Experimental tests of PSI-BLAST using default parameters have determined that proteins identified in the first 20 iterations with expect scores <1x10^-6 are most likely real

Define: Domain

Study These Flashcards

contiguous stretch of amino acids “that look as though they should have independent stability”.

Decreasing the e-value threshold reduces the likelihood of _______ but decreases ________.

Study These Flashcards

false positives; sensitivity

False positives in PSI-BLAST searches are:

Study These Flashcards

High-scoring alignments that are not in fact related to the query

What is required to trigger an ungapped alignment in BLASTn?

One hit on the same diagonal

What is required to trigger an ungapped alignment in BLASTP?

Two hits on the same diagonal

Stages of BLAST

1. BLAST searches for hits 2. BLAST performs ungapped alignment. High scoring ungapped alignments are passed to stage 3. 3. BLAST performs gapped alignment of region surrounding each high-scoring alignment. 4. High scoring alignments are presented to the user.

What types of reads are generated using shotgun sequencing?

whole-genome shotgun reads

What types of reads are generated using hierarchical sequencing?

BAC shotgun reads

Advantages of MSA over pairwise alignments

More information than pairwise alignment Can create phylogenetic trees Can identify conserved regions

ClustalW steps

1. Begins with pairwise alignment and scoring all the pairs 2. Builds phylogenetic tree 3. Most closely related sequences are aligned and form a consensus using dynamic programming. The next closest related sequences are then aligned and form a consensus and so forth.

Iterative search

1. Search database with query sequence 2. Construct multiple alignment from high-scoring aligned sequences 3. Construct a profile using the multiple alignment 4. Search database with profile. Repeat.

E-value

The probability that the sequence is similar to the probe sequence purely by chance

Which profile does PSI-BLAST use?

Position-specific score matrices (PSSMs)

What profile does SAM use?

Hidden Markov Models (HMMs)

Examples of iterative searches

PSI-BLAST and SAM

Weeks 5-6: Genome Assembly; BLAST/FASTA Flashcards

(36 cards)