Genomes and Genome Sequencing Flashcards

Question

Steps in an alignment

Answer 1

Find an approrpriate reference genome - diff. versions Find fragment matches on reference genome

Answer 2

Base calling Quality control Alignment/Mapping Alignment Post-Processing

Answer 3

process of determining bases in the sequencing data

Answer 4

Phred score Q value

Answer 5

Mapping = position of the sequence on the reference genome Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

Answer 6

position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

Answer 7

Variant calling Methylation studies RNA seq. expression Structural variants

Answer 8

Mapping = position of the sequence on the reference genome Alignment = position of the sequence on the reference genome and base-to-base correspondence (whether matches or not)

Answer 9

position of the sequence on the reference genome

Answer 10

Brute Force Method - by eye, move along reference a base pair at a time until matches Alignment Software

Answer 11

by eye, move along reference a base pair at a time until matches

Answer 12

Easy to do Very slow Requires a lot of repetitive computations - inefficient

Answer 13

RNA/DNA/bisulpide sequencing

Answer 14

Burrows Wheeler Transform | Suffix Arrays

Answer 15

Works as replacement for BLAST (BLAST-like methods do not scale well) Trade-off between speed and accuracy (quicker software may be less accurate) Some newer tools use kmers (only mapping data)

Answer 16

Works as replacement for BLAST (BLAST-like methods do not scale well) Trade-off between speed and accuracy (quicker software may be less accurate) Some newer tools use kmers (only mapping data)

Answer 17

all seq. end with a dollar lining up positional order & number then line up lexicographically (alphabetically with $ first) Then take the positional information in lexicographical order of the new list

Answer 18

See whether substring (fragment) matches the middle point (higher or lower lexicographically than the list) If not, cut in half, discount second half. Repeat until found location (matches at that point)

Answer 19

Uses rotations Uses $ symbol all seq. end with a dollar lining up positional order & number then line up lexicographically (alphabetically with $ first) DOES NOT store positional information Stores last column (last character in each line)

Answer 20

More efficient - binary storage (FM index) Compressed further Uses last-first principle Makes substring search quicker (too complex to explain)

Answer 21

Sequence Alignment/Map Format tab deliminated file (columns) Information about mapping of the read

Answer 22

Exact VS Inexact matching | Multi mapping sequences

Answer 23

Will be comparing for difference/ checking that they are there (allow for mismatch of X% - set limit) versus certainty that read from that location [Software will have default value - but changeable]

Answer 24

Regions of ref. genome will be identical in more than one place - repetitive regions - gene families (have similar sequences)

Answer 25

software IGV reference genome along bottom, reference genomes aligned above, with base differences highlighted software Tablet reference at top, shows all bases, highlight differences

Answer 26

amount of reads aligned to that region

Answer 27

differential gene expression | studying the regulome

Answer 28

Amount of alignments aligned to that region = level of expression

Answer 29

regulatory regions in the genome ChIP Seq Chromatin Immunoprecipitation - studying sequence where proteins bound (e.g. transcription factor) BIS Seq - studying methylation

Answer 30

Chromatin Immunoprecipitation looks at regions bounds by proteins (e.g. transcription factors) Fix protein to DNA Use antibody to pull those bits on DNA out Unfix DNA Sequence those bits of DNA

Answer 31

methylation of base pairs treat with bisulphide replaces non-methylated Cs to a U sequence and compare to ref. genome any bases where see a T (DNA U), is unmethylated

Answer 32

detecting single nucleotide polymorphisms (SNP) or insertions/deletions compared to reference genome work out biological implications

Answer 33

Software e.g. GATK (human), FreeBayes (others) Uses SAM formatting file Number of reads at a location Quality of reads Certainty of alignment -> probability

Answer 34

sequencing error rate (e.g. Illumina 99.9% accurate) PCR duplications (amplification of an error) - based on location (usually) Poor coverage Polyploidy (differences due to different alleles, not functional (phenotype) difference) Missing regions of reference genome

Answer 35

"golden standards" - sequencing sample with know variant, should be see these variant in this sample

Answer 36

``` x number of reads out of y total are different + read quality + mapping probability + genotype calculation + standards information ```

Answer 37

single individual or multiple indivduals | each variant locus independently or as a haplotype

Answer 38

variant is unrelated to everything else

Answer 39

looks for consistency in variant in haplotype | looks for links between variant (e.g. if change at x always a change at y)

Answer 40

species speciality e.g. GATK best for humans FreeBayes better for everything else

Answer 41

Make sure that certain that that variant is certain Variant Quality Score (like read quality) Coverage (min. req. for number of reads) Fraction of reads as an alternate allele - which have diff base Base quality of alternate allele

Answer 42

vcflib or vcftools NOT variant calling software itself

Answer 43

Location in genome - coding/non-coding (alter protein product? - synonymous/non-synonymous - what sort of seq. is it binding to - e.g. transcription factor binding (non-coding regions)/stop codon (coding region) - type of impact (e.g. frameshift/INDEL...etc.)

Answer 44

Pieces of genome in genome assembly

Answer 45

pieces two contigs together using scaffolds (gap between two contigs)

Answer 46

FASTA formatted sequence

Answer 47

Common sequences (repetitive - e.g. the word 'the' in a book) Repetiive regions Gene families/pseudogenes - multiple copies of genes Sequencing errors Uneven Coverage

Answer 48

e.g. DNA fragment ~1000bp, first 300bp sequenced (Illumina limit)

Answer 49

e.g. DNA fragment ~1000bp | first 300 bp sequences and last 300 bp sequenced, with gap for middle sequence`

Answer 50

Similar to paired-end Used for scaffolding Can have larger middle gap Up to 20kbp

Answer 51

Similar to paired-end - know that two seq. (contigs) should be near each other Used for scaffolding Can have larger middle gap Up to 20kbp

Answer 52

Using new tech. - e.g. PacBio/MinIon Up to 2Mbp Not as accurate Initial assembly Illumina + long reads for scaffolding

Answer 53

String Graph | de Bruijn Graph

Answer 54

``` theory for sequence assembly Look for overlaps in reads - set minimum overlap requirement (e.g. 3 base pairs) Add nodes and edges Remove redundancy -> graph ```

Answer 55

take sequences and see how overlap with each other, based on whether identical

Answer 56

idea of nodes joined with edges e.g. node = known sequence edge = overlap in sequences (seem to be lines between sequences)

Answer 57

Split sequence into kmers (string of shorter seq. of k length (e.g. 3 = 3bp)) Looks for overlap of kmers, sets minimum overlap of k-1 (e.g. 3-1 = 2). atc-tcg-gtc...etc

Answer 58

which was to read the graph atg cat gta (two atg repeating seq.) so the atg seq. could line up with same region on genome

Answer 59

Path that goes through each node of graph at least once, with minimal length ->rebuilds genome (contigs) contigs come from when cannot join two regions

Answer 60

types of work: single cell genomes/transcriptomics and metagenomics Sequencing data = length of reads/Illumina (types single-end...etc.) species - eukaryotic/prokaryotic

Answer 61

e. g. Peregrine | e. g. Shasta

Answer 62

e.g. SPAdes (bacterial genome) A5 - sequencer-specific ALLPATHS-LG - humans Canu - long reads

Answer 63

amount of nodes and edges smaller kmer = more nodes and edges quality vs contiguity (length of contigs) of data

Answer 64

length of DNA that DNA sequence is split into for assembler graph - de Bruijn e.g. 3 kmer = 3 bp sections

Answer 65

``` Assembly quality Matrix statistics - number of contigs - length of assembly (close to length of expected genome - related species) - is number of genes what expected - accuracy of assembly ```

Answer 66

Assembly quality Matrix statistics - number of contigs - length of assembly (close to length of expected genome - related species) - is number of genes what expected (marker genes) - accuracy of assembly (coverage and contamination) Consider heterozygosity (diploid vs haploid)

Answer 67

point at which 50% of genome covered by contigs of x size or larger e.g. 20 16 12 10 8 5 - N50 = 16 (higher contig value is better) does not take into account missing regions

Answer 68

looks for orthologues in related species shows that expected number of genes BUSCO - relies on evolutionary data (prone to error)

Answer 69

based on CG content and coverage GC content different between species (identifier) & different sequencing depth for diff. species

Answer 70

Promoters | Telomeres/centromeres

Answer 71

Look for start and stop codons - ORF Compare start/stop location to database of another species - try and find orthologues (BLAST) Look at transcritpomic data - this is transcribed

Genomes and Genome Sequencing Flashcards

(95 cards)