De Novo Assembly Flashcards

Question 1

Q

What are mate-pairs (MP)? What types there are?

Answer

A

sequencing the two ends of genomic fragments
usually considered only reads at unique positions
long sequences
two fragments that are distal to each other in the genome and in the opposite in orientation to that of a mate-paired fragment
Paired-out
- mates map on 2 different contigs -> merge and assess gap
Paired-in
- mates map on the same contig -> validate assembly, identify structural variations
Single
- one read maps and its mate-pair does not -> close to a gap

Question 2

Q

What is a contig?

Answer

A

series of overlapping DNA sequences

- merged makes a contigous fragment of DNA

Question 3

Q

What are Structural variations (SV)?

Answer

A

genomic differences involving large segments of DNA
- deletions
- insertions
- inversions
can be normally occurring in the population or pathological

Question 4

Q

What is scaffolding?

Answer

A

linking together a non-contiguous series of genomic sequences into a scaffold
- bridge gaps between contigs

Question 5

Q

What can Pair-end and Mate-pair sequencing be used for?

Answer

A

scaffolding (and gap filling)
validation of de novo genomic sequencing
structural variations detection
two things needed: a genome and the mate paired reads

Question 6

Q

What is physical coverage?

Answer

A

average number of times a base is read or spanned by mate paired reads
mate pairs obtained from long physical inserts would be most effective for scaffolding

Question 7

Q

What is sequence coverage?

Answer

A

average number of times a base is read

Question 8

Q

What does pair-end mean?

Answer

A

short sequence

* set of fragments read on both ends

Question 9

Q

Detection of structural variations with mate pair libraries

Answer

A

insert length statistics for the identification of structural variations
Use of “broken” reads to identify points of
insertion/deletion
- should result broken wher junction occurs

Question 10

Q

Sanger sequencing

Answer

A

1977, Frederick Sanger
- fluorescent dyes, first “automatic” DNA sequencers appeared
- quality deteriorates with length of fragment
based on possibility of creating a lot of subfragments of the region to be sequenced and many copies
- start at same position
- termination should be according to 4 classes, random base, dideoxy terminator

Question 11

Q

Varible loci, what are those? What is the difference between polymorphism and mutation?

Answer

A

in some individuals a locus could be found in 60% of the individuals as a C and in 40% as a T
about 1 out of 500 basis
polymorphism -> at least in 1% of population
mutation -> less than 1%

Question 12

Q

What are the problems and solutions of whole genome sequencing?

Answer

A

genomes of million/billions bases long
Sanger -> 500/1000 bases long reads
possible solutions:
- generate contigours fragments and sequence them (unfeasible)
- shotgun approach (sequence random fragments)

Question 13

Q

Hierarchical shotgun assembly

Answer

A

BACs (Bacterial Artificial Chromosomes) can host 150 kbp of DNA
transfered into E. coli they replicate
BAC clones are sequenced independently, either randomly or to obtain minimum overlap
assemblying in difficult (NP hard) Hamiltonian path
Gaps:
- some regions could not be covered
- repeats make association of contigs difficult

Question 14

Q

Poisson distribution

Answer

A

f(v) = (e^-r * r^v)/v!
- f(v) expected frequency that a base is found v times
- r rendundancy, average coverage
1-e^-r, part of the genome covered at least one time

Question 15

Q

Better assembly strategies

Answer

A

from shotgun reads to contigs
Greedy algorithm
- no need to calculate all possible paths
1/2 *n^2 possible overlaps calculated, n possible reads
Process:
- pairwise alignments of all fragments
- choose two fragments with the largest overlap
- merge
- repeat until only one fragment left
complexity increases as n^2

Question 16

Q

Scaffolding

Answer

A

from contigs to scaffolds
mate-pairs:
- pairing the ends can help merge contigs and resolve difficult repeated regions
impossible to complete a complex genomic sequence with the only approach of sequence overlaps

Question 17

Q

What are the uses of mate pairs?

Answer

A

Scaffolding
- unique pair out quite useful in closing gaps
Assembly validation [contigs]
- mates align on both contig and distance compatible with distance of genomic insert
Gap closure

Question 18

Q

Read assembly and Eulerian paths

Answer

A

Graph, reads vertices and read overlaps arches, occurrences
- sequences occurring at a higher rate may be repeated more than once
Eulerian path, similar but easier to compute
- De bruijn graphs, kmer of length k, one-out-one-in
- long enough, should’t be repeats (present only once)

Question 19

Q

De Bruijn graphs practical application

Answer

A

shotgun library sequenced at high coverage (60x)
fastq -> kmers are counted (kmers and frequency)
start from any kmer, extend one position at a time looking at list:
- no repeats -> only one present
- few counts could be errors
- more counts could be repeats (more edges)

Question 20

Q

De Bruijn graphs and mate pairs

Answer

A

same fragment could be sequenced from both ends
- inverted sequences should be counted as one
kmer best length is l20/30 bases
genomics sequences -> lots of repeats -> many branches
using mate pair libraries could help identify the right path

Brainscape's Knowledge GenomeTM

De Novo Assembly Flashcards

Brainscape's Knowledge Genome^TM