Chaudhuri Flashcards
(87 cards)
What is bioinformatics?
- science of collecting and analysing complex biological data, eg. genetic codes
What does bioinformatics exist at the interface of?
- computing, biology and maths
Why are bioinformatics skills so in demand?
- seq data accum faster than ability to analyse it and even to store it
- transferable an necessary
How quickly has cost of sequencing decreased?
- quicker than Moore’s Law (= computational power ≈ x2 every 18 months
What caused a large decrease in seq cost in 2008?
- Illumina
How is Illumina paired-end sequencing carried out? (overview)
- in library prep, fragments of ≈500bp selected
- bridge amplification results in clusters, each w/ many copies of both strands of fragment
- sequencing reads gen separately using primers complementary to both adaptors
- expect those read pairs to map 500bp apart on opp strands
- 3rd primer used to seq index barcode present in 1 of adaptors to enable identification of sample
What does homologous mean?
- same reaction, relative position or structure
If 2 seqs have 12/16 of the same bases what can you say about identity and homologous?
- 75% identity
- NOT 75% homologous –> either homologous or not
What does aligning 2 seqs tell you?
- how many changes would be req to get those seqs, under assumption that aligned positions share common origin
What does introducing gaps when aligning seqs allow?
- max no. matches
- represents insertions/deletions = indents (a imposs to distinguish)
What is seq alignment used for in bioinformatics?
- identify homologous seqs w/ common ancestor
- assess how similar homologous seqs are to infer evo relationships between groups of seqs
- assemble short reads into contiguous seqs and ultimately seq entire chromosomes/genomes
- map seq reads to reference genome
What is the more likely option when deciding how to align seqs?
- one explained by less evolutionary events
How is seq alignment decided?
- scoring system
- matches assigned +ve score
- mismatches/gaps assigned -ve score
- sometimes 1 penalty for opening new gap and 2nd lower penalty for extending growth (as bigger gaps favoured over several small gaps)
What are scoring matrices and why are they used?
- for nt alignment mismatches usually all treated the same
- for AAs, scoring matrix used so biochemically conservative AA subs penalised less than subs likely to affect protein structure
- eg. BLOSUM62, PAM70
- constructed empirically by examining freq of each AA sub across large collection of protein alignments
- eg. Ile for Leu is match
- eg. Trp v unique and doesn’t like to be sub
What is global alignment, and when is it suitable?
- attempt made to align seqs across entire length
- assumes seqs equivalent
- not suitable for aligning full length seq w/ partial seq
- 1st global alignment algorithm proposed by Needleman and Wunsch
What is local alignment?
- searches subsequences of full length seq to max alignment score
- 1st algorithm by Smith and Waterman
How does BLAST work?
- widely used method of searching database to rapidly identify seqs similar to query seqs
- user supplies query seq and BLAST searches for similar seqs
- performs local alignment to identify regions of hit that match query seq
What is the output of BLAST?
- E value = P value normalised to database size and length of query seq
- effectively no. hits expected to be found by chance in this database
What are the difficulties w/ de novo assembly?
- unknown target
- coverage bias
- sequencing errors
- repeats
- multiple replicons
- contamination
- circular genomes
Why is genome assembly difficult w/ short reads?
- resolving repeats esp hard –> paired end reads can help, but only for repeats smaller than insert size
What is overlap layout consensus seq, and when would it work?
- looks for overlaps between adj reads
- would work well if genomes non repetitive and seq error free
- repeats can result in mis-assembly errors
What are de Bruijn graphs?
- common approach to assembling short reads, to take account of seq errors and repeats in genome
- break read up into Kmers
- K = no. of bases, usually 51/99 (usually odd no.)
What are the advs of de Bruijn graphs?
- stops assembly errors as allow repeats to be identified
- each K-mer in seq once and expect at least 30x coverage for each Kmer and even more for repeat seq
- Kmers only need to be stored in memory once so less RAM needed
- removal of rare Kmers corrects for seq errors
How can bubbles be resolved using read pairs in de Bruijn graphs?
- read pairs can provide info which spans repeat seqs, helping resolve order of contigs and close the assembly
- resolving 1 of key functions of genome assembly software
- if can’t be resolved, results in break in assembly
- as reads get longer, graph gets simpler