Genome assembly Flashcards
(43 cards)
Why do we sequence?
- we are still sequencing new genomes
- can be a new individual
- for DNA protein interacrions
- metagenetic
- Sequence new genome (no previous version)
- Sequence new individuals - how does it differ to reference
- Sequence population - look at variation across population
- Sequence tumour cells and compare to ‘normal’ tissue – where are cancer mutations - time course?
- Sequence transcripts: survey gene-space, also relative quantification by tissue / time / condition
- Sequence as read-out to identify DNA-protein interaction (e.g. chromatin precipitation)
- Metagenomic mixed-organism co-habiting population sequencing: genome fragments, transcripts or rRNAs to identify identity, relative abundance
What are the next gen sequencing technologies
-Illumina
-Oxford nanopore
-PacBio
How do you get high quality In Illumina?
short reads but ht e volume of reads you can get through is quite big
What are the length of the reads in PacBio?
shorter than nanopore but longer than illumina
How do you deal with high error rates in PacBio?
very high error rate - to solve that you sequence multiple times and then because the errors are random you can just align the sequences and then you get a high accuracy
What are quality scores particularly important for?
if you are trying to find SNPs you need to know the quality score to see if you have a sequencing error or an actual variation
What do we need quality scores for?
- Quality scores are assigned to estimate confidence of a given base call
- Phred scores
- aiming for quality score 30 or higher
- The quality scores are used for filtering and trimming of reads
- Also used for assembly
- Base quality scores are essential for variant calling to distinguish a true variant from a sequencing error
Where does the quality decorate?
Quality deteriorates towards the ends of reads
What does AT and GC do?
High AT or GC content reduces complexity and can lead to higher error rates\
What is the formula for QV?
- The quality value ( QV) is related to the base call error probability by the formula
- QV = - 10 x log10( Pe ); where Pe is the probability that the base call is an error
What is base calling?
- in illumina
- Base calling algorithms turn raw intensities into A, T, C, G or N base calls
What is Chastity Filter?
- Usual method for base calling in Illumina systems is known as Chastity Filter
- Chastity filter calls a base if the intensity divided by the sum of highest and second highest intensity is no less than a threshold of 0.6 (usually). Otherwise it is marked as N
What is Fast Q format?
- the standard output format for next gen sequencing output
- all the programs rely now on that format
What do they use for quality scores in Fast Q?
they use ascii values for quality scores so you get char to char association
Describe the standard output
- 4 lines per sequence
- Line 1 begins with the @ character, a sequence ID and an optional description
- Line 2 is the sequence
- Line 3 begins with the + character and, optionally followed by the same sequence ID and description
- Line 4 encodes the quality values for the sequence letters in line 2 and must contain the same number of characters
What is depth of coverage useful for?
Sequencing errors are eliminated by the depth of coverage of overlapping sequence fragments
What was the depth coverage in the human genome project?
- For the Human Genome Project, most of the genome was sequenced at 12X or greater
coverage. - Each base was present in 12 reads on average.
- Even with 12x coverage approximately 1% of the genome not accurately assembled
Describe paired end sequencing
- you go from both ends so you two reads per fragment
- reads are shorter than sequence
- gives you information how far away from each other the sequences are in the genome
What do we do with repeats in paired end sequencing?
- it is quite tricky to assemble a genome when you have repeats because the you can’t see whihc one the sequence was
- to solve that then you have to anchor the reads using other sequences overlaping with the sequence
- If one read is unmappable because it falls in a very repetitive region, but the other is unique, you can again use that distance information to map both reads
- One read can be mapped and the second can then be positioned within the repeat
- With enough paired end reads the entire repeat can be mapped
- With large repeats (LINE etc) paired ends won’t be able to map entire repeat
Describe pmate pairs sequencing
- Mate pairs are similar to paired ends but the insertion length is much greater
- Paired ends are a few hundred bp but mate pairs are kb long
- DNA fragmented into 2-5Kb fragments and the ends repaired with biotin labelled dNTPs
- The fragments are then circularised and fragmented
- Biotin labelled fragments captured, adapters added and sequenced from both ends, as with paired end reads distance between reads known
What do you need for scaffolding?
-contig
-scaffold
What are contigs?
- Contiguous sequence where base order is known
- Assembled from sequence reads
What are scaffolds?
- Genome sequence reconstructed from contigs and gaps
- Gaps are where reads (paired end or mate pairs, depending on gap length) from the two sequenced ends of at least one fragment overlap with other reads in two different contigs
- Approx length of fragments are known so number of bases between contigs are estimated
What is de novo sequenicing>?
- The genome is sequenced and assembled for the first time so there is no reference.
- When the human genome was first sequenced it was de novo.
- De novo is the more difficult and challenging of the two methods.
- De novo projects may use multiple technologies to sequence full genome