Week 7 (Variant Calling) Flashcards
after aligning the genome, we will want to identify ___________
variants
what is a variant?
differences between our sample and the reference
what are the types of variants we are exploring in this class?
- SNPs
- Indels
SNPs
single nucleotide polymorphisms
Indels
insertions/deletions (small <50 bp)
unaligned sequence data file formats
- FASTA
- FASTQ
Aligned sequence data file formats
- SAM
- BAM
- BAI
- CRAM
SAM
sequence alignment map
BAM
binary (compressed) version of SAM
BAI
index for BAM
variant calls (SNPs and Indels)
- VCF
- BCF
VCF
variant call format
BCF
binary of vcf (compressed version)
mandatory fields of SAM
- what [4]
- where [5]
- how good (or bad) [2]
aligners ouput a _____ file that is then compressed to a ______ file
SAM; BAM
what is MAPQ?
in the SAM file, MAPQ will tell you how good of a job we do mapping the read to the reference. the lower the map qualities the worse we did at mapping.
what parts of the genome often have low MAPQ?
repetitive regions and simple regions
do we prefer systematic errors or random errors? why?
- we prefer random error
- random error occurs randomly and can be overcome. systematic errors are difficult to find and can mess up all of the data.
________ errors are preferred. ________ errors cause issues with downstream analysis.
random; systematic
what happens if there are errors that occur during PCR, what happens to your data?
the error is transferred all the way down your analysis
two types of duplicate reads
- PCR
- optical
what is optical duplicate read?
more prevalent on patterned flow cells (due to imaging on older flow cells and exAmp chemistry on patterned flow cells)
how are duplicates identified?
based on the starting positions of the read
what is a downside to using the patterned flow cell?
you end up with far more optical duplicates