Genomes and Genome Sequencing Flashcards

(95 cards)

1
Q

Application of studying genomic

A

Research
Health (e.g. diagnostic)
Environment (e.g. pollutants)
Agriculture (e.g. livestock, nutrients)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Health example for genomics

A

Causes of severe intellectual disability in children (42% of cases linked to DNA compared to 12% using other methods)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Disease example for genomics

A

Inflammatory Bowel Disease (Crohn’s disease)
more viral DNA = more viruses
viruses were bacteriophages
they infected gut bacteria and affected gut bacteria population -> Crohn’s disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Disease Outbreak Tracking for genomics (only need one)

A

Ebola - finding point of origin, watching it change over time

HIV - identified known origin, identified species crossovers

Influenza - track current outbreaks of influenza to inform vaccine choices for coming winter in opposite hemisphere/ identify crossover/ crossover potential for strains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The third generation of DNA sequences

A

Longer DNA sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sanger Sequencing

A

Chain termination sequencing

Uses DDNTPs (fluorescently labelled nucleotides)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does Sanger Sequencing work

A

polymerase rebuilds double helix using normal nucleotides, then randomly adds a fluorescently labelled base, polymerase stops and sequence cut at that point

-> strands of DNA of varying lengths, each ending with a fluorescently-labelled base
(* as many times req. so substitute each base in length)

Then run small pieces on capillary electrophoresis gel

Record fluorescence

Each base is a diff. colour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Downsides of Sanger Sequencing

A

Slow

Expensive

Not high throughput

Errors in repetitive regions (lots of bases similar to each other, next to each other)

Bias in sequencing (certain regions better amplified than others)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Library Preparation

A

Extract DNA from cells

Fragment DNA (50-1000bp)

Add adaptors (either end of seq.) one will stick to seq., other will be start point for seq. reaction

Amplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Issues with Library Preparation

A

Bias in amplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Illumina Sequencing work?

A

Fragements added to the flow cell - bind to flow cell (adapter-flow cell)

Polymerases starts at top (furthest from flow cell) and add in fluorescently labelled nucelotides (randomly, on at a time)
+laser excitation, fluorescence recorded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Benefits of Illumina Sequencing

A

Fast

Cheap

High throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Issues with Illumina Sequencing

A

Repetitive regions

Amplification

Length resistrictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Third generation sequencing

A

prevent length resistriction

take out need to amplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PacBio SMRT

A

uses Single Molecule, Real-time Technology

Zero-mode wave-guides

One piece of DNA per well

Polymerase in well adds fluorescence like Illumina to single piece of DNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PacBio Considerations

A

Higher error rates
No need for amplification
Longer, but not genome-length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Oxford Nanopore Minion

A

Very small

Membrane with many pores

Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Oxford Nanopore MinION

A

Very small

Membrane with many pores

Feeds single length of DNA through pore, changes in electrical current along membrane indicates base, this is read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Oxford Nanopore MinION Consideration

A

Does not use fluorescently-labelled nucleotides

Not as accurate as Illumina (99.9%), but close (95%)

Long read (up to 2 million bp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the Prometheon?

A

48 MinIONS

large amounts of sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Single-Cell Sequencing

A

uses Illumina
BUT with diff. lib preparation - single-cell

Each cell in a ‘gem’ - when gel broken open all contents labelled with barcode for indv. gem

Can say where DNA comes from -> cell types/spatial transcriptomics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Challenges to genome projects

A

Sequencing technologies not perfect (e.g. Illumina 99.9% not 100%)

Some DNA harder to seq. than others (e.g. centromere/telomere) - secondary structures

Population representation (variation)

Gaps. errors, lack of variation

Accuracy of assemblage

Genomes keep being corrected (diff. versions from same individual)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Alignments

A

Reference genome available

Compare and align

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Assembly

A

Does not have an available reference genome

Assemble reads into a reference genome

Is a BEST REPRESENTATION not exact

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Steps in an alignment
Find an approrpriate reference genome - diff. versions Find fragment matches on reference genome
26
Steps of the alignment analysis
Base calling Quality control Alignment/Mapping Alignment Post-Processing
27
Base calling
process of determining bases in the sequencing data
28
Quality control
Phred score Q value
29
Mapping vs. Alignment
Mapping = position of the sequence on the reference genome Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)
30
Alignment
position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)
31
Options for Alignment Post-Processing
Variant calling Methylation studies RNA seq. expression Structural variants
32
Mapping vs. Alignment
Mapping = position of the sequence on the reference genome Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)
33
Mapping
position of the sequence on the reference genome
34
Ways to align fragment sequence to a reference genome
Brute Force Method - by eye, move along reference a base pair at a time until matches Alignment Software
35
What is the "Brute Force" method?
by eye, move along reference a base pair at a time until matches
36
Considerations with the "Brute Force" method
Easy to do Very slow Requires a lot of repetitive computations - inefficient
37
Alignment Software Types
RNA/DNA/bisulpide sequencing
38
Alignment Software Algorithms
Burrows Wheeler Transform | Suffix Arrays
39
Considerations with Alignment Software
Works as replacement for BLAST (BLAST-like methods do not scale well) Trade-off between speed and accuracy (quicker software may be less accurate) Some newer tools use kmers (only mapping data)
40
Considerations with Alignment Software
Works as replacement for BLAST (BLAST-like methods do not scale well) Trade-off between speed and accuracy (quicker software may be less accurate) Some newer tools use kmers (only mapping data)
41
How you make a Suffix Array?
all seq. end with a dollar lining up positional order & number then line up lexicographically (alphabetically with $ first) Then take the positional information in lexicographical order of the new list
42
Alignment to a suffix array
See whether substring (fragment) matches the middle point (higher or lower lexicographically than the list) If not, cut in half, discount second half. Repeat until found location (matches at that point)
43
How you make a Burrows Wheeler Transform?
Uses rotations Uses $ symbol all seq. end with a dollar lining up positional order & number then line up lexicographically (alphabetically with $ first) DOES NOT store positional information Stores last column (last character in each line)
44
Considerations with Burrows Wheeler Transform
More efficient - binary storage (FM index) Compressed further Uses last-first principle Makes substring search quicker (too complex to explain)
45
SAM format
Sequence Alignment/Map Format tab deliminated file (columns) Information about mapping of the read
46
Difficulties during alignments
Exact VS Inexact matching | Multi mapping sequences
47
Exact vs Inexact matches
Will be comparing for difference/ checking that they are there (allow for mismatch of X% - set limit) versus certainty that read from that location [Software will have default value - but changeable]
48
Multi mapping sequences
Regions of ref. genome will be identical in more than one place - repetitive regions - gene families (have similar sequences)
49
Alignment visualisation steps
software IGV reference genome along bottom, reference genomes aligned above, with base differences highlighted software Tablet reference at top, shows all bases, highlight differences
50
depth/coverage (alignments)
amount of reads aligned to that region
51
biological regions for sequence alignments
differential gene expression | studying the regulome
52
Differential gene expression using alignments
Amount of alignments aligned to that region = level of expression
53
Studying the regulome
regulatory regions in the genome ChIP Seq Chromatin Immunoprecipitation - studying sequence where proteins bound (e.g. transcription factor) BIS Seq - studying methylation
54
ChIP Seq
Chromatin Immunoprecipitation looks at regions bounds by proteins (e.g. transcription factors) Fix protein to DNA Use antibody to pull those bits on DNA out Unfix DNA Sequence those bits of DNA
55
BIS Seq
methylation of base pairs treat with bisulphide replaces non-methylated Cs to a U sequence and compare to ref. genome any bases where see a T (DNA U), is unmethylated
56
Variant Calling
detecting single nucleotide polymorphisms (SNP) or insertions/deletions compared to reference genome work out biological implications
57
How to detect variation (variant calling)?
Software e.g. GATK (human), FreeBayes (others) Uses SAM formatting file Number of reads at a location Quality of reads Certainty of alignment -> probability
58
Challenges in variant calling
sequencing error rate (e.g. Illumina 99.9% accurate) PCR duplications (amplification of an error) - based on location (usually) Poor coverage Polyploidy (differences due to different alleles, not functional (phenotype) difference) Missing regions of reference genome
59
How the GATK software overcome variant calling issues
"golden standards" - sequencing sample with know variant, should be see these variant in this sample
60
How do variant callers work?
``` x number of reads out of y total are different + read quality + mapping probability + genotype calculation + standards information ```
61
Different approaches of variant callers depend on...
single individual or multiple indivduals | each variant locus independently or as a haplotype
62
variant locus independent approach to variant calling
variant is unrelated to everything else
63
haplotype approach to variant calling
looks for consistency in variant in haplotype | looks for links between variant (e.g. if change at x always a change at y)
64
how to choose variant calling software
species speciality e.g. GATK best for humans FreeBayes better for everything else
65
Filtering Variants
Make sure that certain that that variant is certain Variant Quality Score (like read quality) Coverage (min. req. for number of reads) Fraction of reads as an alternate allele - which have diff base Base quality of alternate allele
66
Tools for filtering variants
vcflib or vcftools NOT variant calling software itself
67
Interpreting filtered variants
Location in genome - coding/non-coding (alter protein product? - synonymous/non-synonymous - what sort of seq. is it binding to - e.g. transcription factor binding (non-coding regions)/stop codon (coding region) - type of impact (e.g. frameshift/INDEL...etc.)
68
contigs
Pieces of genome in genome assembly
69
scaffolds
pieces two contigs together using scaffolds (gap between two contigs)
70
Genome Assembly Output
FASTA formatted sequence
71
Challenges to Genome Assembly
Common sequences (repetitive - e.g. the word 'the' in a book) Repetiive regions Gene families/pseudogenes - multiple copies of genes Sequencing errors Uneven Coverage
72
Single end sequencing data
e.g. DNA fragment ~1000bp, first 300bp sequenced (Illumina limit)
73
Paired-end sequencing data
e.g. DNA fragment ~1000bp | first 300 bp sequences and last 300 bp sequenced, with gap for middle sequence`
74
Mate pair sequencing data
Similar to paired-end Used for scaffolding Can have larger middle gap Up to 20kbp
75
Mate pair sequencing data
Similar to paired-end - know that two seq. (contigs) should be near each other Used for scaffolding Can have larger middle gap Up to 20kbp
76
Long reads sequencing data
Using new tech. - e.g. PacBio/MinIon Up to 2Mbp Not as accurate Initial assembly Illumina + long reads for scaffolding
77
Types of Assemblers
String Graph | de Bruijn Graph
78
String Graphs
``` theory for sequence assembly Look for overlaps in reads - set minimum overlap requirement (e.g. 3 base pairs) Add nodes and edges Remove redundancy -> graph ```
79
Concept of overlaps
take sequences and see how overlap with each other, based on whether identical
80
Concept of graphs
idea of nodes joined with edges e.g. node = known sequence edge = overlap in sequences (seem to be lines between sequences)
81
de Bruijn Graphs
Split sequence into kmers (string of shorter seq. of k length (e.g. 3 = 3bp)) Looks for overlap of kmers, sets minimum overlap of k-1 (e.g. 3-1 = 2). atc-tcg-gtc...etc
82
de Bruijn Graphs and repeating regions
which was to read the graph atg cat gta (two atg repeating seq.) so the atg seq. could line up with same region on genome
83
How do assemblers use graphs?
Path that goes through each node of graph at least once, with minimal length ->rebuilds genome (contigs) contigs come from when cannot join two regions
84
How to choose an assembler
types of work: single cell genomes/transcriptomics and metagenomics Sequencing data = length of reads/Illumina (types single-end...etc.) species - eukaryotic/prokaryotic
85
Long read data assemblers
e. g. Peregrine | e. g. Shasta
86
Examples of Assembler Software
e.g. SPAdes (bacterial genome) A5 - sequencer-specific ALLPATHS-LG - humans Canu - long reads
87
Important of kmer length
amount of nodes and edges smaller kmer = more nodes and edges quality vs contiguity (length of contigs) of data
88
What is a kmer?
length of DNA that DNA sequence is split into for assembler graph - de Bruijn e.g. 3 kmer = 3 bp sections
89
How to determine best kmer length?
``` Assembly quality Matrix statistics - number of contigs - length of assembly (close to length of expected genome - related species) - is number of genes what expected - accuracy of assembly ```
90
Assembly quality determination
Assembly quality Matrix statistics - number of contigs - length of assembly (close to length of expected genome - related species) - is number of genes what expected (marker genes) - accuracy of assembly (coverage and contamination) Consider heterozygosity (diploid vs haploid)
91
What is the N50?
point at which 50% of genome covered by contigs of x size or larger e.g. 20 16 12 10 8 5 - N50 = 16 (higher contig value is better) does not take into account missing regions
92
Presence of marker genes...
looks for orthologues in related species shows that expected number of genes BUSCO - relies on evolutionary data (prone to error)
93
Coverage and contamination...
based on CG content and coverage GC content different between species (identifier) & different sequencing depth for diff. species
94
Assembly annotation
Promoters | Telomeres/centromeres
95
Levels of genome annotation
Look for start and stop codons - ORF Compare start/stop location to database of another species - try and find orthologues (BLAST) Look at transcritpomic data - this is transcribed