Genome assembly Flashcards

Question

What is reference sequencing?

Answer 1

- The genome has already been sequenced so a reference is available - For subsequent re-sequencing the reference can be used as a scaffold for the assembly

Answer 2

For n reads there are 2n2 - 2n possible overlaps

Answer 3

- The simplest assembly method - Finds two sequences with largest overlap and merges them - Repeats until no further assembly possible. - The choices made by the assembler are local and do not take into account the global relationship between reads - Limited to simple assemblies due to read lengths and local assembly method - Cannot easily use global information such as paired end reads/mate pairs, which help resolve repetitive genomes - Phrap uses the crossmatch program which is a full implementation of the Smith Waterman algorithm

Answer 4

- Find the best match between the suffix of one read and the prefix of another - Mismatches allowed in overlaps for sequencing errors - Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring - Determine path through reads to create layout - Create local multiple alignments from the overlapping reads - Consensus derived from alignments

Answer 5

K-mer - all the possible substrings of length k that are contained in a string

Answer 6

- Sort all k-mers in the reads (typically 16 – 24 based) and index them - K-mer - all the possible substrings of length **k** that are contained in a string - Identify pairs of reads that share a k-mer - Extend to full alignment and discard if not >95% similar - This technique drastically reduces the search space and has been widely used - Even with this improvement the computational requirement to identify all possible overlaps from next-gen short reads is a significant limitation - OLC is suitable for Sanger sequencing reads (1 kb) and long PacBio reads (up to a few tens of kilobases)

Answer 7

- With Sanger sequencing reads represented as nodes in a graph and edges represent alignments - Following Hamiltonian cycle can construct genome by concatenating each read - Note this forms a circular genome - Hamiltonian cycle visits all nodes (reads) once only and returns to start position - However, this does not scale for the millions of reads from next gen genome sequencing

Answer 8

- For any genome we can use the same approach to reconstruct it - For assembly ideally need all k-mers present in the genome to be assembled - Each k-mer should appear at most once in the genome - Genome can then theoretically be assembled by following graph through the k-mers - The larger the genome the larger the required k-mer - This is the basis of de Bruijn graph assembly

Answer 9

- Split reads into all possible k-mers – removes redundancy in reads - Follow Hamiltonian cycle in which each successive node (k-mer) is shifted by one nucleotide Use of k-mers means that even though an individual k-mer may overlap with more than one other there is only one overlap that provides a path through the graph that passes through each k-mer only once

Answer 10

- The Hamiltonian graph approach is used by numerous assemblers: SOAPdenovo , SGA and ABySS among others - Traversing all nodes at once leads to the nondeterministic polynomial time (NP) -complete problem as the number of nodes increases - As the size of the genome increases, the computation time required to solve the graph problem increases infinitely - To compensate for this assembly programs adjust and simplify the graph, for example reducing branching nodes - An alternative approach used by other assemblers (Velvet, EULER, SPAdes etc) is to use a Eulerian path. - This scales better to larger genome

Answer 11

- All k-mer prefixes and suffixes represented as nodes - Each prefix and suffix can only occur once in the graph. (Note they will be much larger than 2 nuc in full genome assembly graph) - Edges represent k-mers having particular prefixes and suffixes - k-mer edge ATG has prefix AT and suffix Perform Eulerian cycle through graph - visits every edge of the graph exactly once

Answer 12

- Hamiltonian or Eulerian have the same requirements in order to assemble a complete genome: - Requirements – if met a path through the graph, visiting each edge once, is possible if: - Containsallk-mersinthegenome(unlikelytooccur).Ensuresgraph balanced - in directed graph number of edges in is same as number out - All k-mers are error free (next gen sequences contain errors) - Each k-mer occurs at most once in the genome (problem with repeats but paired end reads help to overcome this) - Assembly programs adapt the method to compensate for these issues e.g. removing branches - Low coverage areas will lead to multiple contigs - Final stage of assembly is scaffolding, using paired end reads to join contigs

Answer 13

- Assembly requires presence of all (or nearly all) k-mers in genome - Illumina reads are approx 100-200bp+ – k-mer of 100+ - Reads will not contain all possible 100-mers etc present in genome, however deep the coverage - Assemblers will break each read into overlapping k-mers e.g. 46 overlapping 55- mers (for 100bp read) - This ensures that nearly all 55-mers in the genome are detected - The k-mer size can be set when running he assembly so different options can be tried as optimum option depends of the genome sequence

Answer 14

- **Uniting** - The initial assembly of sequences using a de Bruijn graph approach - **Contig** - Paired-end reads aligned to the unitigs and the pair information is used to orient and merge overlapping unitigs - **Scaffold** - Align mate-pair reads to the contigs to orient and join them into scaffolds - “N” characters are inserted at any gaps in coverage and for unresolved repeats

Answer 15

- The most resource demanding stage of the de Bruijn assembly, including memory requirement - All k-mers from the sequence reads are stored in a hash table- Additional information for each k-mer is also stored: - Number of k-mer occurrences in the reads - Presence or absence of possible neighbour k-mers in the de Bruijn graph

Answer 16

- A Bloom filter is a compact data structure for representing a set of elements that supports two operations: - (1) inserting an element into the set. These are the k-mers - (2) querying for the presence of an element in the set - Used by ABySS and reduces the memory requirement - The Bloom filter structure consists of a bit vector and one or more hash functions - The hash functions map each k-mer to a corresponding set of positions within the bit vector - the bit signature - A k-mer is added to the Bloom filter by setting the its bit value to one - Queried by testing if all positions of its bit signature are one

Answer 17

- To filter out the majority of k-mers caused by sequencing errors all k-mers with an occurrence count below a user-specified threshold are discarded - Optimum minimum typically 2-4 - Retained k-mers are called solid k-mers - In the second pass through the reads those that consist entirely of solid k-mers (solid reads) are extend left and right within the de Bruijn graph to create unitigs - During the read extension phase of assembly it’s possible for multiple solid reads to result in the same unitig - Avoided by using an additional tracking Bloom filter to record k-mers included in previous unitigs - A solid read is only extended if it has at least one k-mer that is not already in the tracking Bloom filter

Answer 18

- Longer reads have enabled return to overlap graph approach - String graph uses same methodology as overlap graph but simplified - First, contained reads (red) - reads that are substrings of some other read - are removed: The resulting graph, called a string graph, shares many properties with the de Bruijn graph without the need to break the reads into k-mers

Answer 19

- Theoretical work on efficiently constructing the string graph using the FM index led to memory-efficient assemblers for large genomes. - The FM index is based on the Burrows-Wheeler transform and the suffix array

Genome assembly Flashcards

(43 cards)