Genome sequencing and assembly 3 Flashcards
What are contigs?
Continuous, gapless, sequences of DNA that have been assembled from overlapping pieces
What are scaffolds?
larger structures made by linking contigs together, using additional information to span gaps between them
What is gap filling?
Filling in gaps between contigs/scaffolds that exist after shotgun sequencing
How does gap filling work?
If they had the start and end btis of the gap they filled it in using internal primers all along the way of the length of the gap
What is a physical gap?
A stretch of the sequence that isn’t present in the clone library
Why may physical gaps exist?
Gap regions may have been unstable in the cloning library
Which types of sequence are difficult to sequence?
Transposons, tandem repeats, centromeres–> repetitive sequences
Issue with sequencing repetitive elements?
A sequence that lies partly or wholly within a repeat element might be assigned as an overlap in a different part of the repeat
Frequency of repetitive elements in pro and eukaryotes?
Not common in pro, v common in eu
Challenges in genome assembly?
Long reads–> high error rate in ON
short reads–> difficult to assemble large genomes with them
Examples of genome assembly algorithms?
Overlap layout consensus, De Bruijn Graph
Overlap layout consensus?
Identifies overlap regions in fragments
Used to create a layout which connects the reads in order
Most likely overall sequence is determined
Pros of Overlap layout concensus?
Useful for assembling genomes from long data reads
De Bruijn Graph?
Breaks down fragments into much shorter sequences (k-mers)
(K is the amount of nucleotides)
Connects the nodes (v short sequences) that have k-1 similarities
De Bruijn graph example?
TGA–> GAC–> ACC–> CCG
Ways to assess genome assembly quality?
Completeness, Depth and COverage, Contiguity
Completeness of genome assembly?
How much of the genome has been assemble without gaps
Depth and coverage of genome assembly?
Number of times each base is sequenced. Higher coverage ensures accuracy
Contiguity of genome assembly?
Assembled into contigs or scaffolds
What is the N50 score?
Average length of contigs/scaffolds in assembly
Good N50 score rn?
> 1Mb
What does BUSCO stand for?
benchmarking universal single copy orthologs
What is BUSCO?
A method of measuring the completeness of genome assembly by comparing it against a set of highly conserved ortholog genes
e.g. each organism has RNA pol so it looks for that gene etc
Pangenome?
The complete set of genes within a species, encompassing both core genes shared by all individuals and variable genes present in some but not others