De Novo Assembly Flashcards

1
Q

What are mate-pairs (MP)? What types there are?

A
  • sequencing the two ends of genomic fragments
  • usually considered only reads at unique positions
  • long sequences
  • two fragments that are distal to each other in the genome and in the opposite in orientation to that of a mate-paired fragment
  • Paired-out
    • mates map on 2 different contigs -> merge and assess gap
  • Paired-in
    • mates map on the same contig -> validate assembly, identify structural variations
  • Single
    • one read maps and its mate-pair does not -> close to a gap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a contig?

A
  • series of overlapping DNA sequences

- merged makes a contigous fragment of DNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Structural variations (SV)?

A
  • genomic differences involving large segments of DNA
    • deletions
    • insertions
    • inversions
  • can be normally occurring in the population or pathological
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is scaffolding?

A
  • linking together a non-contiguous series of genomic sequences into a scaffold
    • bridge gaps between contigs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can Pair-end and Mate-pair sequencing be used for?

A
  • scaffolding (and gap filling)
  • validation of de novo genomic sequencing
  • structural variations detection
  • two things needed: a genome and the mate paired reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is physical coverage?

A
  • average number of times a base is read or spanned by mate paired reads
  • mate pairs obtained from long physical inserts would be most effective for scaffolding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is sequence coverage?

A
  • average number of times a base is read
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does pair-end mean?

A
  • short sequence

* set of fragments read on both ends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Detection of structural variations with mate pair libraries

A
  • insert length statistics for the identification of structural variations
  • Use of “broken” reads to identify points of
    insertion/deletion
    • should result broken wher junction occurs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sanger sequencing

A
  • 1977, Frederick Sanger
    • fluorescent dyes, first “automatic” DNA sequencers appeared
    • quality deteriorates with length of fragment
  • based on possibility of creating a lot of subfragments of the region to be sequenced and many copies
    • start at same position
    • termination should be according to 4 classes, random base, dideoxy terminator
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Varible loci, what are those? What is the difference between polymorphism and mutation?

A
  • in some individuals a locus could be found in 60% of the individuals as a C and in 40% as a T
  • about 1 out of 500 basis
  • polymorphism -> at least in 1% of population
  • mutation -> less than 1%
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the problems and solutions of whole genome sequencing?

A
  • genomes of million/billions bases long
  • Sanger -> 500/1000 bases long reads
  • possible solutions:
    • generate contigours fragments and sequence them (unfeasible)
    • shotgun approach (sequence random fragments)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hierarchical shotgun assembly

A
  • BACs (Bacterial Artificial Chromosomes) can host 150 kbp of DNA
  • transfered into E. coli they replicate
  • BAC clones are sequenced independently, either randomly or to obtain minimum overlap
  • assemblying in difficult (NP hard) Hamiltonian path
  • Gaps:
    • some regions could not be covered
    • repeats make association of contigs difficult
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Poisson distribution

A
  • f(v) = (e^-r * r^v)/v!
    • f(v) expected frequency that a base is found v times
    • r rendundancy, average coverage
  • 1-e^-r, part of the genome covered at least one time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Better assembly strategies

A
  • from shotgun reads to contigs
  • Greedy algorithm
    • no need to calculate all possible paths
  • 1/2 *n^2 possible overlaps calculated, n possible reads
  • Process:
    • pairwise alignments of all fragments
    • choose two fragments with the largest overlap
    • merge
    • repeat until only one fragment left
  • complexity increases as n^2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Scaffolding

A
  • from contigs to scaffolds
  • mate-pairs:
    • pairing the ends can help merge contigs and resolve difficult repeated regions
  • impossible to complete a complex genomic sequence with the only approach of sequence overlaps
17
Q

What are the uses of mate pairs?

A
  • Scaffolding
    • unique pair out quite useful in closing gaps
  • Assembly validation [contigs]
    • mates align on both contig and distance compatible with distance of genomic insert
  • Gap closure
18
Q

Read assembly and Eulerian paths

A
  • Graph, reads vertices and read overlaps arches, occurrences
    • sequences occurring at a higher rate may be repeated more than once
  • Eulerian path, similar but easier to compute
    • De bruijn graphs, kmer of length k, one-out-one-in
    • long enough, should’t be repeats (present only once)
19
Q

De Bruijn graphs practical application

A
  • shotgun library sequenced at high coverage (60x)
  • fastq -> kmers are counted (kmers and frequency)
  • start from any kmer, extend one position at a time looking at list:
    • no repeats -> only one present
    • few counts could be errors
    • more counts could be repeats (more edges)
20
Q

De Bruijn graphs and mate pairs

A
  • same fragment could be sequenced from both ends
    • inverted sequences should be counted as one
  • kmer best length is l20/30 bases
  • genomics sequences -> lots of repeats -> many branches
  • using mate pair libraries could help identify the right path