Week 3.1 Data Types in Genomics Flashcards

(12 cards)

1
Q

Nucleotide Group Codes

12

A
  1. R -> G or A (purines)
  2. Y -> T or C (pyrimidines)
  3. K -> G or T
  4. M -> A or C
  5. S -> G or C
  6. W -> A or T
  7. B -> G, T, or C
  8. D -> G, A, or T
  9. H -> A, C, or T
  10. V -> G, C, or A
  11. N -> Any base
  12. ”-“ -> gap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

FASTA file format

2 lines

A
  1. Sequence Identifier (“>” through end of line)
  2. Sequence data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Phred (Q) Score

Examples

A
  • Q10 = 1:10 = 90% accuracy
  • Q20 = 1:100 = 99% accuracy
  • Q30 = 1:1000 = 99.9% accuracy

Error probability= 10^(-Q/10)
Accuracy = (1 - error probability) * 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Phred Quality Score

A
  • prediction of probability of an error in base calling
  • Q = -10log10 (e)
  • e = base calling error probability
  • generally range 0-40
  • error prone due to extreme GC bias, specific patterns or homopolymers
  • data with less than Q 20 not valuable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Phred

Analyzing process

A
  1. calculates several parameters related to peak shape and peak resolution at each base position
  2. use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
  3. score (gray bar) matches to each colored peak (base)

-No score = N base assignment (not good enough peak for base call)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quality Score

FASTQ format

A

4 lines with separated fields per sequence read

  1. @Sequence_Identifier
  2. ATGC Raw Sequencing Data
  3. +(sequence id again or description) - marks seq end
  4. Quality values for each base (ASCII)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

FASTQ file

A
  • text-based format for storing both the nucleotide sequence and its corresponding quality data
  • quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
  • Quality score = ASCII character code -33
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

FASTQ file

Quality Score code

lowest to highest (1 point each)

A
  • 1:10: !”#$%&’()*
  • 11-20: +,-./01234
  • 21-30: 56789:;< = >
  • 31-40: @ABCDEFGHI
  • 41-50: JKLMNOPQRS
  • 51-60: TUVWXYZ[\]
  • 61-70: ^_`abcdefg
  • 71-80: hijklmnopq
  • 81-93: rstuvwxyz{|}~
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Non FASTA/FASTQ file formats

3

A
  1. SAM / BAM
  2. VCF
  3. GFF3, BED, & Genbank
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

File formats

SAM / BAM

A

useful when you are aligning sequences with each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

File formats

VCF

A

used to store information on sequence variants (helpful if you are studying mutations that may cause disease)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

File formats

GFF3, BED, and Genbank formats

A

useful when you want to associate features with specific regions of sequence in a file (eg. gene annotations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly