Flashcards in Chaudhuri Deck (130)
What 1st gen sequencing platforms are there?
- Maxam Gilbert (but no longer used)
- Sanger dideoxy seq
Why did Sanger become dominant in the field?
- indiv, long (approx 1kb), high quality reads
What 2nd gen sequencing platforms are there?
- Illumina (most popular)
- also Helicos, SOLiD, 454, IonTorrent (still used but infreq)
What 3rd gen sequencing platforms are there?
- Oxford Nanopore
- NABsys (no longer used)
- more to come?
What do PacBio and Oxford Nanopore have in common?
- both these are single mol seq, fewer reads than Illumina but v long (over 10kb), individual reads have quite high error rate
What key properties of sequencing reads determine the approp applications?
- no. of reads (related to data output and running costs) and the read length
Can you get good read lengths and read depth?
- originally was trade off between these, eg. Illumina has many short reads, vs Sanger produced few long reads
- now there are technologies which can produce many long reads (>10kb), eg. Nanopore ProMethION and PacBio Sequel II
What is the read length for Sanger?
- up to 1kb
What is the read length for Illumina?
What is the read length for PacBio?
- up to 50kb
What is the read length for Oxford Nanopore?
- can be >2Mb (theoretically unlimited)
How many reads can Sanger prod per run?
- 1 (some machines up to 96)
How many reads can Illumina prod per run?
- millions (MiSeq), billions (HiSeq)
How many reads can PacBio prod per run?
- approx 500k (Sequel)
How many reads can Oxford Nanopore prod per run?
- up to 1 mil (MinION)
What is the accuracy of Sanger?
- highly accurate (>99.9%)
What is the accuracy of Illumina?
- highly accurate (>99.9%)
What is the accuracy of PacBio?
- raw reads approx 85% accurate, can be improved to >99.8% w/ CCS (circular consensus sequencing)
What is the accuracy of Oxford Nanopore?
- raw reads no approx 95% accurate
What are the applications of Sanger?
- PCR products
What are the applications of Illumina?
- draft genome seqs (w/ gaps)
- resequencing and variant detection
functional genomics (RNA-seq, ChIP-seq)
What are the applications of PacBio?
- complete genome sequencing (ie. finished genomes as longer than repeats
- detection of DNA meth (ie. base mods, epigenetics)
What are the applications of Oxford Nanopore?
- complete genome sequencing
- direct RNA-seq
How does Illumina work?
- cut gDNA into 200-600bp fragments
- add adapters (know seq of and can use to amplify fragments of genome which dont know seq of)
DNA fragments which bind adapters are made ss
- adapters able to bind to oligos on flow cell surface
- unlabelled nt bases and DNA pol added to lengthen and join DNA seqs
- adapter seq at other end binds another type of oligo on surface and creates ‘bridges’ of ds DNA on flowcell surface (by seqs folding over and hybridising to oligosl)
- in situ PCR = bridge amplification
--> amplify original DNA to form small clusters of DNA w/ same seq
- dsDNA bridges broken down to ssDNA w/ heat
- primers and fluorescently labelled bases added to flowcell
- primer binds DNA being seq and allows DNA pol to bind
DNA pol adds bases to DNA
- lasers used to activate fluorescent label and camera detects this fluorescence
- each base gives off diff colour
What are the clusters in illumina flow cells and how are they distrib?
- each cluster derived from single initial mol and corresponds to separate read
- clusters distrib randomly on flow cell surface
What limits the max no. of reads it is poss to obtain from a single run of Illumina?
- density of clusters determines total yield or reads, but if adj clusters too close seq cannot be resolved (ie. want them as close as poss whilst still being able to resolve)
What has helped improve the no. of reads can obtain from a single Illumina run?
- software improvements have allowed increased cluster density
- also technical improvements --> eg. higher res cameras in machines, thus means can have small and closer clusters
How had patterned flow cells allowed increased cluster density?
- instead of flat surface, flow cell covered w/ tiny nanowells
- primers for branch amp only present w/in each well, so get single cluster gen in each well from a single starting mol
- amp rapid, so well fills up, preventing other mols from entering
- know exact position of each well, so cluster can be identified unambiguously
- cluster cannot spread outside well, so no overlapping clusters, means clusters can packed tightly
What recent devs has there been for Illumina?
- Illumina X ten released in 2015
- new system targeted exclusively at seq human genomes
- machines cost $1 mil and have to buy 10
What is the significance of $1000 human genome, and what achieved this goal?
- long standing goal of genomics, as it is the point it becomes feasible to offer genome sequencing as a routine service in healthcare)
- Illumina X ten can, inc consumables, labour and depreciation
What other systems are available for patterned flow cells?
- HiSeq X Five and HiSeq400
What is the main problem w/ Illumina, and what is this due to?
- reads are limited in length as the quality of base cells reduces later in read, resulting in more errors
- due to problems of phasing
What is phasing?
- by chance a random base will not incorporate into 1 of the reads, then this read will be lagging behind by 1 base, so start to get a mixed signal
- this can happen repeatedly and as this develops become less confident in the colour of the cluster, as more mixed
What is pre-phasing?
- early incorp of bases
- essentially the opp problem to phasing
How can the problem of phasing be solved?
- often reads will be trimmed to remove low quality seq prior to analysis (gets v low quality after around 100bp)
- recent software improvements on Illumina MiSeq allowed dynamic correction of phasing problems, increasing read length to 300bp
Why can corrections to solve problems of phasing no be used for the HiSeq?
- correction computationally intensive, so can only used on a small scale
How can 2 colour Illumina seq be carried out?
- 4 bases sequenced using only 2 colours, rather than 1 for each base
- allows simpler optics in machine, therefore lower costs
- T = green, C = red, A = green and red, G = no colour
- used in NextSeq500 and Miniseq
What is PacBio RSII and its apps?
- single molecule real time sequencing (SMRT)
- can gen v long reads (>10kb)
- apps inc finishing small genomes, microbial epigenetics, targeted seqs
How has PacBio advanced?
- PacBio Sequel allowed more reads at lower cost
- now Sequel II released
- becoming more practical get such long reads at a lower cost, this is mainly done by improvements in optics and chemistry
Can the read length from Oxford Nanopore be improved?
- already at theoretical max --> limit is length of DNA mol, w/ some reads reported of >2Mb
What Oxford Nanopore developments have happened?
- highly portable MinIONs now commercially available, w/ a few early publications
- scale of sequencing has been improved by the release of the larger PromethION and GridION systems
What is the PromethION system?
- essentially lots of MinIONs together (25) --> potential to prod lots of high quality long reads
What is the GridION system?
- 5 MinION flow cells at a time
What is VolTRAX?
- automated Oxford Nanopore library prep, goal is to take any biological sample (eg. blood, bacterial culture), deposit straight on machine, will extract DNA suitable to pass straight onto MinION sequencer
In 1983 at the dawn of bacterial genomics what was known in the field?
- only 2.3x106 bps had been seq, less than the size of most bacterial genomes
- largest genome sequences was phage lambda, around 50,000 bp
- aimed to next seq bacterial genomes
- and eventually humans
What was the 1st bacterial genome sequencing project to be initiated, and how was this done?
- E. coli K-12
- sequencing ordered clones based on a genetic map
Why was E. coli K-12 not the 1st bacterium to be sequenced, despite being the 1st to be initiated?
- method was slow and laborious
What were the 1st bacterial genomes to be sequenced, and how was this done?
- Venter adopted a shotgun sequencing approach and sequenced Haemophilus influenzae and Mycoplasma genitalium in 1995 (both have small genomes, <2Mb)
What does shotgun sequencing rely on?
- computational assembly of seq from random clone libs
How is whole genome shotgun sequencing carried out?
- take (usually) circular bacterial chroma and use sonication/enzymatic methods to randomly shear DNA into small fragments
- size select, so all about same size
- clone indiv fragments into plasmid vectors
- pick colonies to create shotgun lib
- plasmid preps
- seq each insert w/ 2 primers (using Sanger)
- get whole chunks of genome which are representative, but also gaps
- PCR over gap regions, to fill them in (can be most time consuming part)
What did Kimelman et al (2012) do in the field of bacterial genomics?
- reinvestigated Sanger seq data of bacterial genomes
- tested hypothesis that assembly gaps correspond to seqs toxic to E. coli
- identified many compounds toxic to E. coli
- found novel toxins and restriction enzs, and new classes of small noncoding RNAs that reproducibly inhibit E. coli growth
- suggests new modes of antimicrobial intervention
How was E. coli K-12 originally iso?
- from convalescent diphtheria patient in 1922
Why was E. coli K-12 the obvious candidate for the 1st bacterial genome sequencing project?
- standard E. coli for lab studies, as grows rapidly
What is the significance of regions of low GC content in E. coli K-12?
- may have been acquired by horizontal transfer
How freq is it for E. coli K-12 to acquire genes by horizontal transfer?
- paper found 755 candidates
- at least 234 horizontal transfer events since diverged from Salmonella
- these genes tend to cluster together, can be acquired together
As well a lab model organism, what else is E. coli?
- normal component of gut flora of humans and animals
- also a wide range of pathogenic strains
What was the 2nd E. coli genome to be seq, and what is the significance of this strain?
- E. coli O157:H7
- emergent human pathogen assoc w/ haemorrhagic colitis and haemolytic uraemic syndrome (HUS), which can lead to kidney failure and sometimes can be fatal
How did the genome size of E. coli O157:H7 compare to K-12
- genome approx 5.5Mb, around 1Mb bigger than K-12
What was found when comparing E. coli K-12 and O157:H7?
- presence of O- and K- islands
- O- island are regions of O157:H7 which do not have comparable seq in K-12, so likely derived from horizontal transfer
- also presence of K- islands, which was surprising
What was the 3rd E. coli genome to be sequenced, and what is this bacterium responsible for?
- strain of uropathogenic E. coli (UPEC)
- eg. of extraintestinal E. coli (ExPEC), assoc w/ UTIs
- ExPEC can be harmless when in intestines but become pathogens when invade urinary tract, blood or CSF
- UPEC strains responsible for 70-90% of 7 mil cases of acute cystitis and 250,000 cases of pyelonephritis reported annually in US
How did UPEC CFT073 genome compare to O157:H7?
- similar in size
- 3 way comparison found extra seqs diff from those in O157:H7
What were the results of estimating the size of the E. coli core genome?
- carried out by taking 1 strain at a time and identify seqs present in all strains looked at so far
- around 2200 genes
What were the results of estimating the size of the E. coli pangenome?
- look at no. of unique genes, on av find approx 300 new genes in each (seq 1st and all unique, then less in second etc. and this continues until get to around 300 new genes)
- so effectively infinite in size, no matter how many strains seq, will continue to find new genes
What is the relevance of draft bacterial genomes?
- short read Illumina seq has dramatically increased no. of available bacterial genomes
- however, finishing (filling in gaps) and annotation remain laborious processes, so most genomes are left as drafts
What do most gaps in draft genomes correspond to?
- repeats, eg. IS elements (transposable elements), rRNA, operons
How can gaps in genomes be filled in more efficiently?
- contigs can be reordered relative to complete reference genome, making assumption that genome structure is conserved
- automated annotation pipelines, eg. Prokka, are increasingly used
How has the no. of bacterial genomes in GenBank changed over the years?
- complete genomes are increasing
- but draft genomes increasing much more rapidly
Why might the no. of complete genomes soon increase much more rapidly?
- due to increased long read sequencing techs
- may soon be poss in automated manner, eg. Oxford Nanopore MinION and PacBio Sequel
What can you do w/ lots of bacterial genomes?
- GWAS studies to find causal genetic factors underlying important phenotypes
- eg. identified vitamin B5 biosynthesis as a key host specificity factor in Campylobacter (common cause of gastroenteritis) t/ GWAS
--> always present in cattle derived strains, but not chickens (may be due to diffs in adaption to host diet)
How are E. coli and Shigella distinguished?
- on basis of motility, metabolic profile and clinical manifestation
How does the motility of E. coli and Shigella differ?
- E. coli = motile
- Shigella = non-motile
Are E. coli and Shigella pathogenic?
- E. coli = usually commensal
- Shigella = obligate pathogens
What did Serotyping studying show about differentiating between bacterial strains?
- demonstrated that the O-antigen, H (flagellar) antigen and the K (capsular) antigen are useful for distinguishing between strains
What is serotyping, and what Abs are used?
- typing based on immune recognition of cell surface antigens
- bacteria of same serotype cross-react to the same Abs
- for E. coli the O, H (and sometimes K) antigens are used for serotyping
What was an early molecular study of diversity?
- Milkman (1973) began quantitative study of E. coli pop genetics by measuring electrophoretic mobility of enzs derived from diff E. coli strains (MLEE)
What is multi-locus enzyme electrophoresis (MLEE)?
- involves assessing the electrophoretic mobility of a series of purified enzs
- produces quantitative mol data which can be used to understand the evolutionary relationships between strains
Does serotyping correlate well w/ MLEE results for genetic diversity?
- no, genetically similar strains can have diff serotypes and distantly related strains can share the same serotype
What is the ECOR collection?
- set of phylogenetically diverse E. coli strains chosen based on MLEE data
- selected to represent full diversity of species, but doesn't fully represent pathotypes, as mostly commensal
- 5 phylogenetic groups: A, B1, B2, D, E
What did comparison of E. coli and Shigella gene seqs show?
- Shigella arisen on multiple occasions from E. coli, inc some of the specific biochemical properties
= convergent evolution
Apart from convergent evo what else affects genes in bacterial strains, and how?
- indiv strains can pick up copy of gene from diff strain, and will therefore have diff evolutionary history to rest of genome
- looked at diff genes in diff strains to see how related they were
- if eg. take K12 and RM191F then they were closely related for most genes, but for another gene were quite divergent, suggesting recombination has occurred in this gene
What is multi-locus sequencing typing (MLST)?
- an alt to MLEE
- involves amplification and sequencing of regions of around 400bp (convenient amount to PCR amplify) of 7/8 housekeeping genes distrib around genome
- housekeeping genes as thought less likely to be affected by recomb as so important (strong purifying selection)
What evidence is there for convergent evo of bacterial strains?
- carried out MLST
- chose pathogenic and nonpathogenic strains
- EHEC is pathogenic and contains a toxin, 2 strains have acquired this separately
- EPEC are similar, but cause less serious disease, but also have 2 indep groups, have acquired genes necessary for virulence on 2 sep occasions
What is the significance of core genome phylogenetics?
- now can take core genome and understand relationships between bacteria
- but do sometimes recombine, so can be diffs in phylogeny when look at indiv genes, but get correct phylogeny when take all of the core genome as a whole
Why is 16s rRNA used to study microbial diversity?
- found in all bacteria
- has important function in cell so tends to evolve slowly as constantly being used
What parts of 16s rRNA vary and what is the significance of these regions?
- 9 V loops
- conserved regions which don't change much so can design primers to them, spanning one of the variable regions, amplify up variable region and seq it
- V loops seem to be in right amount of diversity that can use them to look at diffs, but can still align them
How was 16s rRNA profiling?
- universal primers used to amp V regions of 16s rRNA gene, which variable
- seq using Sanger, Illumina or PacBio
- seqs comp w/ rRNA databases such as Greengenes to identify taxas
What were the limitations of 16s rRNA profiling?
- primers not completely universal, so may not work in some organisms
- contamination, amplify up their 16s genes instead, even low levels are a problem as amplified
- sequencing errors can result in overestimation of diversity of organisms present
- some organisms have multiple copies of 16s rRNA gene, vary in seq and can result in overestimation of no. of taxa present
- PCR bias may result in incorrect quantification of species
What did Woese's phylogenetic tree of life show?
- showed archaea have separate kingdom (thought to be part of bacteria) and actually more closely related to euks
Why do we have a skewed model of the microbial world?
- most of what's known about bacteria, is from species that can be grown in the lab and studied
- up to 99% of bacteria cannot be cultured in the lab, these species referred to as ‘microbial dark matter’
- big problem in understanding microbial diversity
What approach has been used to explore microbial dark matter?
What did a study into MOs in the Sargasso Sea involve and what was found?
- environmental whole genome shotgun sequencing
- seq all samples well to get deep coverage
- identified 1.2 mil unknown genes and found many novel species
What is an important env to study in terms of metagenomics?
What did studying the microbiome of humans find?
- revealed links between health conditions that weren’t thought to be assoc w/ bacteria, but there was a diff in microbial diversity between people w/ and w/o the disease → inc cardiovascular disease, obesity, IBD
How does metagenomics differ from genomics?
- DNA extracted from env sample
How is metagenomics carried out?
- DNA extracted, fragmented and sequenced, on eg. Illumina
- deep sequencing req, as lots of organisms present and otherwise will just see those present at high abundance
- indiv seq reads can be identified using software such as Kraken
- alt, de novo assembly can be used to piece together larger contigs
What is the problem with assembly of metagenomic data, and how can this be reduced?
- assembly is complex, computationally intensive and error prone
- helped by long read techs, eg. PacBio and Oxford Nanopore --> Oxford Nanopore PromethION prod enough data to make long read metagenomics poss
How was it attempted to carry out a proper control for metagenomic studies?
- seq w/o any original DNA
- saw lots of species, due to reagents and kits used to extract DNA, ie. was a base level of contamination of microbial DNA
- known as the ‘kit-ome’
What happened as a result of a paper seq the biome of the NY subway?
- evidence for anthrax and bubonic plague
- but several later papers debunked this
- was no gene which caused the plague or anthrax present
- original paper seq strains and looked at database for those most closely related, but the genes assoc w/ these diseases weren’t actually present
How is single cell genomics carried out?
- indiv cells iso by eg. laser microdissection, micropipetting, optical tweezers, cell sorting (FACS)
- the single copy of the genome is PCR amp, seq and assembled
What is the issue w/ single cell genomics?
- amp is challenging and assembled genomes will often have patchy coverage
What is iChip and how is it carried out?
- microfluidic tech
- samples diluted so on av 1 bacterial cell ends up in each chamber in the iChip
- chambers filled w/ molten agar and covered w/ semi perm mem
- iChip placed back in native env
- allows colony to grow from single cell, which provides enough material for DNA seq
- method to culture but w/in native enc, and can still carry out sequencing
What article has been published w/ iChip?
- claiming a new antibiotic found that kills pathogen w/o detectable resistance = teixobactin
- v effective against gram +ve bacteria
What was the 1st pathogenic E. coli to be characterised?
- EPEC (enteropathogenic E. coli)
How is EHEC pathogenic?
- encodes type III secretion system which it uses to inject host cells w/ effector molecules, to allow it to attach to the gut wall
- subverts host cellular machinery
What emergent pathogen is an eg. of EHEC?
What are the characteristics of O157:H7?
- similar to EPECs, but also encode a shiga toxin, which is assoc w/ the most severe disease (HUS)
What are the symptoms of EAEC infections, and how?
- aggregate in a characteristic “stacked-brick” conformation to aid survival in gut
- assoc w/ mild self-limiting diarrhoea
Why was the E. coli outbreak in Germany 2011 unusual?
- usually affects children and elderly (weaker IS), but this affected otherwise healthy adults and in particular young women and was unusually severe
How was the source of the 2011 Germany E. coli outbreak found?
- tracked diets of infected t/ filling in forms
- infected cucumbers 1st implicated --> salads often infected
- spread beyond Germany --> t/ people travelling, export etc.
- claimed it was Spanish cucumbers --> had huge economic impact, as Spanish cucumber crops weren’t bought
- but no evidence of E. coli in Spanish cucumbers
- eventually narrowed down to bean sprouts
At the time of the 2011 E. coli outbreak what was the most popular seq tech and why?
- ion torrent
- at the time Illumina took 2 weeks and this was o/n
- published data from ion torrent
How did sequencing play an important role in identifying the source of the 2011 E. Coli outbreak?
- race to study indiv sequence reads and piece them together to make complete seq of lethal E. coli and find out why this outbreak was so severe
- crowdsourced analysis
--> v quickly got genome seq data (the 1st time it was poss to do this during an outbreak
What role did phylogenetic analysis play in the 2011 E. coli outbreak?
- phylogenetic analysis found the outbreak seq is EAEC (not EHEC) --> found as almost identical to complete genome seq of an existing E. coli based on core genome analysis
What seq was done after the 2011 E. coli outbreak?
- PacBio seq of O104:H4 strain
- comp seq of TY2482 w/ 55989
What were the notable features of the 2011 E. coli outbreak?
- high incidence in adults
- greatly increased incidence of haemolytic uraemic syndrome (HUS) relative to other EHEC outbreaks (25% vs. 15%)
- particularly high prevalence in women, inc most of HUS cases
- strains iso from cases exhibited a rare serotype (O104:H4)
- recognition of outbreak strains was hampered by inapprop use of diagnostic tests focused on O157:H7
- characterisation of the causative agent t/ whole genome sequencing was performed whilst outbreak still in progress
What seq would we now do if there was a severe E. coli outbreak?
- rapid genome seq w/ eg. MinION
What are the advantages of Oxford Nanopore over PacBio and how was this put to use?
- much smaller, despite prod similar data
- v rapid
- rapid draft sequencing carried out for hospital outbreak of Salmonella
--> gen data in less than half a day and could thus distinguish outbreak from sporadic cases
Why was portable sequencing needed during the Ebola outbreak?
- sequencing was needed in the field but limited infrastructure in many of the outbreak locations
- so MinIONs were used (also needed PCR machines, reagents etc.)
What was the importance of sequencing Ebola in the field?
- started from 1 initial virus but diverged into diff lineages as spread --> quickly gens mutations t/ error prone rep
- so useful to know what particular lineage people had, and therefore where they got it from, so could find others at risk and quarantine them to minimise spread
What were the challenges of sequencing in the field?
- intermittent electricity supply so have to back everything up w/ UPS, to stop sequencing run failing
- poor internet connection
What could be seen through real time analysis of Ebola virus?
- can see spread and how virus evolved
- also map of genome and see how diff regions vary over the course of the outbreak
What did MinION become the 1st tech to do?
- to seq DNA in space
What is the SmidgION?
- plugs into phone, input sample and can seq it
How can ultra-long nanopore reads be achieved?
- mainly to do w/ how extract DNA, as need to keep v long fragments intact so they can be sequenced
- dev of protocols to extract v high mol weight DNA
If recomb between E. coli is less freq, then what happens?
- get periodic selection
- allows neutral variant to be fixed w/in pop by ‘hitch-hiking’ w/ genetically beneficial mutation
- forces pop t/ bottleneck and reduces diversity at that locus
How was PacBio seq carried out after 2011 E. coli outbreak?
- seq 11 related EAEC strains for comparative analysis
- av of 2700 bp reads, w/ some much longer
- combined w/ CCS to get 99.9% accuracy
How was rapid sequencing carried out for a hospital outbreak of Salmonella?
- initially investigated w/ Illumina MiSeq, then w/ MinION
- both yielded reliable and actionable clinical info in less than half a day
- for Illumina used new draft sequencing protocol to reduce time, inc by reducing read length and cycle time --> enough coverage to conclude all part of same outbreak
- w/ MinION could unambiguously identify strain in under 30 mins
What is now poss w/ NGS, demonstrated by seq of Ebola epidemic?
- can gen genomic data directly from diagnostic patient samples
How does IonTorrent work?
- seq DNA w/ semi conductor chip w/ millions of wells
- DNA cut into millions of fragments
- each fragment attaches to a bead, and copied until covers bead
- beads washed over chip and deposited into well
- chip flooded w/ 1 nt at a time, if incorp then H+ released and alt pH which can be detected
- this is repeated w/ diff nts to get seq
How does PacBio work?
- utilises power of DNA pol
- diff fluorescent labels added to each base on terminal phosphate, so the pol cleaves label as part of rep process, leaving a natural DNA strand
- label then visualised in zero mode waveguide (ZMW) chamber