Week 6 (QC and Alignment) Flashcards

1
Q

the ability to resolve a repetitive structure is dependent on the __________ of the molecules in your library

A

length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sanger sequencing is ~_______bp accurate

A

1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

short read sequencers

A

illumina and element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

short read sequencers use <____ bp and have a _______ error rate

A

<150bp; low error rate («1%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

long read sequencers

A
  • PacBio
  • Oxford Nanopore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

long read sequencers have a lower number of reads but much longer being ____ to _____ kb, with a _______ error rate of ~____%

A

10’s to 100’s kb; higher error rate of ~1%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the standard file formats?

A
  • FASTA
  • FASTQ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

FASTA has ___ parts

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the parts of FASTA

A
  1. > sequence name (always has >)
  2. sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FASTQ has ___ parts

A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the parts of FASTQ

A
  1. @ sequence name (and other info)
  2. sequence
    • (sometimes other info)
  3. quality value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

when is FASTA used?

A

when per base quality is not needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what does FASTA present?

A

presents only the sequence itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

the FASTA sequence name always starts with ______ symbol

A

>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

when is FASTQ used?

A

when per base quality is needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what does FASTQ present?

A

presents the sequence and the estimated base quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

FASTQ sequence name always starts with ____ symbol

A

@

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

when reading the sequence in FASTQ, what does the letter “N” symbolize?

A

any of the 4 nucleotides, it did not know which nucleotide it was, so the more N’s the worse the quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

the _______ the Q value the more accurate the sequence is

A

higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Quality scores increase by a factor of _____

A

10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Qphred equation

A

Qphred = -10log10P(error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

At any given position in a sequence, the base present is either A/C/T/G but we cannot _________ observe the base. The base that is produced from a DNA sequencer is an observation based on some biochemical/physical property that has an ________.

A

directly; error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q20 is _____ times more accurate then Q10

24
Q

when using fluorescence in illumina, we notice a change in color distribution with each cycle. How does this affect our accuracy?

A

clear signal intensity decreases as you do more cycles. this occurs because their may have been failure to cleave previous fluors on nucleotides

25
Fred scale quality values of ___ or less (__, ___, ___, ___)
40 or less (30, 20, 10)
26
in the program, the phred scale quality values are a ______ character, saving a lot of space
single
27
what is a error that can occur in element? does this happen often?
single base error, however this does not happen very often
28
what is FASTQC?
- QC = quality control - one of the many software tools to evaluate quality
29
FASTQC does not actually do any filtering it....?
provides summary metics and visuals
30
In Base Quality - is this good data or poor data?
good data
31
In Base Quality - is this good data or poor data?
poor data
32
in per tile sequence quality, which os the good data and which is the poor data? how do you know?
good data on the left and poor data on the right. on the poor data there must have been a problem between cycle 5 and 15 in this location in the flow cell because there are lines shown.
33
in these per sequence quality scores, which one has good data and which has poor data? how do you know?
good data on the left has a high sequence quality almost to the end of the reads. the poor data is on the right with sequence quality that begins to decrease earlier in the reads.
34
in the per base sequence content, which has good data and which has poor data? how do you know?
good data on the left because it has a uniform base composition to the end of the reads. the poor data on the right's base composition diverges indicating problems.
35
error probability: Q10
0.1 (1 in 10)
36
error probability: Q20
0.01 (1 in 100)
37
error probability: Q30
0.001 (1 in 1000)
38
error probability: Q40
0.0001 (1 in 10000)
39
k-mer
a substring of a fixed length that appears in a biological sequence
40
4^(kmer length)
number of possible nucleotides
41
are K-mers odd or even integers?
use odd integers
42
sequence alignment
a string matching problem, using a reference genome to match your sequences
43
two main algorithms for alignment | (of short read sequencing data)
- smith-waterman - burrow-wheeler transform
44
smith-waterman
guaranteed to find the optimal local alignment with respect to the coring matrices used
45
what is smith-waterman best used for?
good for aligning "Sanger" length data and it is too slow for high throughput data
46
burrow-wheeler transform
creates a suffix array of smaller k-mers, index the reference genome, match the seed of the read to the reference and extend seed to full alignment
47
what are the most used aligners?
BWA (WGS) and STAR (RNA)
48
what two algorithms for alignment are best used in short read sequencing data?
- smith-waterman - burrow-wheeler transform
49
what algorithm is used for long read? | (in alignment)
minimap2
50
two main sequence file formats
FASTA and FASTQ
51
high throughput sequence data needs _____
QC
52
what softwatr tool provides summary and visualization to evaluate quality
FastQC
53
sequence alignment
- smith-waterman - burrow-wheeler transform (BWA and STAR)
54
what does "depth" mean when sequencing?
the amount of times it has been sequenced
55
what is the ability to resolve a repetitive structure dependent on?
the length of the molecules in your library
56
when analyzing a pileup, you notice that one of the lines has equal parts A and C. What does this mean? Is this an error in the sequence?
this is most likely not an error, this is showing that on the chromosome from one parent you got an A and the other you had a C, this is representing both chromosomes (heterozygous)