Week 3.1 Data Types in Genomics Flashcards

Question 1

Q

Nucleotide Group Codes

12

Answer

A

R -> G or A (purines)
Y -> T or C (pyrimidines)
K -> G or T
M -> A or C
S -> G or C
W -> A or T
B -> G, T, or C
D -> G, A, or T
H -> A, C, or T
V -> G, C, or A
N -> Any base
”-“ -> gap

Question 2

Q

FASTA file format

2 lines

Answer

A

Sequence Identifier (“>” through end of line)
Sequence data

Question 3

Q

Phred (Q) Score

Examples

Answer

A

Q10 = 1:10 = 90% accuracy
Q20 = 1:100 = 99% accuracy
Q30 = 1:1000 = 99.9% accuracy

Error probability= 10^(-Q/10)
Accuracy = (1 - error probability) * 100

Question 4

Q

Phred Quality Score

Answer

A

prediction of probability of an error in base calling
Q = -10log_{10_(e)}
e = base calling error probability
generally range 0-40
error prone due to extreme GC bias, specific patterns or homopolymers
data with less than Q 20 not valuable

Question 5

Q

Phred

Analyzing process

Answer

A

calculates several parameters related to peak shape and peak resolution at each base position
use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
score (gray bar) matches to each colored peak (base)

-No score = N base assignment (not good enough peak for base call)

Question 6

Q

Quality Score

FASTQ format

Answer

A

4 lines with separated fields per sequence read

@Sequence_Identifier
ATGC Raw Sequencing Data
+(sequence id again or description) - marks seq end
Quality values for each base (ASCII)

Question 7

Q

FASTQ file

Answer

A

text-based format for storing both the nucleotide sequence and its corresponding quality data
quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
Quality score = ASCII character code -33

Question 8

Q

FASTQ file

Quality Score code

lowest to highest (1 point each)

Answer

A

1:10: !”#$%&’()*
11-20: +,-./01234
21-30: 56789:;< = >
31-40: @ABCDEFGHI
41-50: JKLMNOPQRS
51-60: TUVWXYZ[\]
61-70: ^_`abcdefg
71-80: hijklmnopq
81-93: rstuvwxyz{|}~

Question 9

Q

Non FASTA/FASTQ file formats

3

Answer

A

SAM / BAM
VCF
GFF3, BED, & Genbank

Question 10

Q

File formats

SAM / BAM

Answer

A

useful when you are aligning sequences with each other

Question 11

Q

File formats

VCF

Answer

A

used to store information on sequence variants (helpful if you are studying mutations that may cause disease)

Question 12

Q

File formats

GFF3, BED, and Genbank formats

Answer

A

useful when you want to associate features with specific regions of sequence in a file (eg. gene annotations)

Week 3.1 Data Types in Genomics Flashcards

(12 cards)