Week 3.1 Data Types in Genomics Flashcards
(12 cards)
Nucleotide Group Codes
12
- R -> G or A (purines)
- Y -> T or C (pyrimidines)
- K -> G or T
- M -> A or C
- S -> G or C
- W -> A or T
- B -> G, T, or C
- D -> G, A, or T
- H -> A, C, or T
- V -> G, C, or A
- N -> Any base
- ”-“ -> gap
FASTA file format
2 lines
- Sequence Identifier (“>” through end of line)
- Sequence data
Phred (Q) Score
Examples
- Q10 = 1:10 = 90% accuracy
- Q20 = 1:100 = 99% accuracy
- Q30 = 1:1000 = 99.9% accuracy
Error probability= 10^(-Q/10)
Accuracy = (1 - error probability) * 100
Phred Quality Score
- prediction of probability of an error in base calling
- Q = -10log10 (e)
- e = base calling error probability
- generally range 0-40
- error prone due to extreme GC bias, specific patterns or homopolymers
- data with less than Q 20 not valuable
Phred
Analyzing process
- calculates several parameters related to peak shape and peak resolution at each base position
- use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
- score (gray bar) matches to each colored peak (base)
-No score = N base assignment (not good enough peak for base call)
Quality Score
FASTQ format
4 lines with separated fields per sequence read
- @Sequence_Identifier
- ATGC Raw Sequencing Data
- +(sequence id again or description) - marks seq end
- Quality values for each base (ASCII)
FASTQ file
- text-based format for storing both the nucleotide sequence and its corresponding quality data
- quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
- Quality score = ASCII character code -33
FASTQ file
Quality Score code
lowest to highest (1 point each)
- 1:10: !”#$%&’()*
- 11-20: +,-./01234
- 21-30: 56789:;< = >
- 31-40: @ABCDEFGHI
- 41-50: JKLMNOPQRS
- 51-60: TUVWXYZ[\]
- 61-70: ^_`abcdefg
- 71-80: hijklmnopq
- 81-93: rstuvwxyz{|}~
Non FASTA/FASTQ file formats
3
- SAM / BAM
- VCF
- GFF3, BED, & Genbank
File formats
SAM / BAM
useful when you are aligning sequences with each other
File formats
VCF
used to store information on sequence variants (helpful if you are studying mutations that may cause disease)
File formats
GFF3, BED, and Genbank formats
useful when you want to associate features with specific regions of sequence in a file (eg. gene annotations)