Week 6 (QC and Alignment) Flashcards by Emma Sellers

the ability to resolve a repetitive structure is dependent on the __________ of the molecules in your library

length

How well did you know this?

Not at all

Perfectly

Sanger sequencing is ~_______bp accurate

1000

How well did you know this?

Not at all

Perfectly

short read sequencers

illumina and element

How well did you know this?

Not at all

Perfectly

short read sequencers use <____ bp and have a _______ error rate

<150bp; low error rate («1%)

How well did you know this?

Not at all

Perfectly

long read sequencers

PacBio
Oxford Nanopore

How well did you know this?

Not at all

Perfectly

long read sequencers have a lower number of reads but much longer being ____ to _____ kb, with a _______ error rate of ~____%

10’s to 100’s kb; higher error rate of ~1%

How well did you know this?

Not at all

Perfectly

what are the standard file formats?

FASTA
FASTQ

How well did you know this?

Not at all

Perfectly

FASTA has ___ parts

How well did you know this?

Not at all

Perfectly

what are the parts of FASTA

> sequence name (always has >)
sequence

How well did you know this?

Not at all

Perfectly

FASTQ has ___ parts

How well did you know this?

Not at all

Perfectly

what are the parts of FASTQ

@ sequence name (and other info)
sequence
- (sometimes other info)
quality value

How well did you know this?

Not at all

Perfectly

when is FASTA used?

when per base quality is not needed

How well did you know this?

Not at all

Perfectly

what does FASTA present?

presents only the sequence itself

How well did you know this?

Not at all

Perfectly

the FASTA sequence name always starts with ______ symbol

How well did you know this?

Not at all

Perfectly

when is FASTQ used?

when per base quality is needed

How well did you know this?

Not at all

Perfectly

what does FASTQ present?

presents the sequence and the estimated base quality

How well did you know this?

Not at all

Perfectly

FASTQ sequence name always starts with ____ symbol

How well did you know this?

Not at all

Perfectly

when reading the sequence in FASTQ, what does the letter “N” symbolize?

any of the 4 nucleotides, it did not know which nucleotide it was, so the more N’s the worse the quality

How well did you know this?

Not at all

Perfectly

the _______ the Q value the more accurate the sequence is

higher

How well did you know this?

Not at all

Perfectly

Quality scores increase by a factor of _____

How well did you know this?

Not at all

Perfectly

Qphred equation

Qphred = -10log10P(error)

How well did you know this?

Not at all

Perfectly

At any given position in a sequence, the base present is either A/C/T/G but we cannot _________ observe the base. The base that is produced from a DNA sequencer is an observation based on some biochemical/physical property that has an ________.

directly; error

How well did you know this?

Not at all

Perfectly

Q20 is _____ times more accurate then Q10

Study These Flashcards

when using fluorescence in illumina, we notice a change in color distribution with each cycle. How does this affect our accuracy?

Study These Flashcards

clear signal intensity decreases as you do more cycles. this occurs because their may have been failure to cleave previous fluors on nucleotides

Fred scale quality values of ___ or less (__, ___, ___, ___)

40 or less (30, 20, 10)

in the program, the phred scale quality values are a ______ character, saving a lot of space

single

what is a error that can occur in element? does this happen often?

single base error, however this does not happen very often

what is FASTQC?

- QC = quality control - one of the many software tools to evaluate quality

FASTQC does not actually do any filtering it....?

provides summary metics and visuals

In Base Quality - is this good data or poor data?

good data

In Base Quality - is this good data or poor data?

poor data

in per tile sequence quality, which os the good data and which is the poor data? how do you know?

good data on the left and poor data on the right. on the poor data there must have been a problem between cycle 5 and 15 in this location in the flow cell because there are lines shown.

in these per sequence quality scores, which one has good data and which has poor data? how do you know?

good data on the left has a high sequence quality almost to the end of the reads. the poor data is on the right with sequence quality that begins to decrease earlier in the reads.

in the per base sequence content, which has good data and which has poor data? how do you know?

good data on the left because it has a uniform base composition to the end of the reads. the poor data on the right's base composition diverges indicating problems.

error probability: Q10

0.1 (1 in 10)

error probability: Q20

0.01 (1 in 100)

error probability: Q30

0.001 (1 in 1000)

error probability: Q40

0.0001 (1 in 10000)

k-mer

a substring of a fixed length that appears in a biological sequence

4^(kmer length)

number of possible nucleotides

are K-mers odd or even integers?

use odd integers

sequence alignment

a string matching problem, using a reference genome to match your sequences

two main algorithms for alignment | (of short read sequencing data)

- smith-waterman - burrow-wheeler transform

smith-waterman

guaranteed to find the optimal local alignment with respect to the coring matrices used

what is smith-waterman best used for?

good for aligning "Sanger" length data and it is too slow for high throughput data

burrow-wheeler transform

creates a suffix array of smaller k-mers, index the reference genome, match the seed of the read to the reference and extend seed to full alignment

what are the most used aligners?

BWA (WGS) and STAR (RNA)

what two algorithms for alignment are best used in short read sequencing data?

- smith-waterman - burrow-wheeler transform

what algorithm is used for long read? | (in alignment)

minimap2

two main sequence file formats

FASTA and FASTQ

high throughput sequence data needs _____

what softwatr tool provides summary and visualization to evaluate quality

FastQC

sequence alignment

- smith-waterman - burrow-wheeler transform (BWA and STAR)

what does "depth" mean when sequencing?

the amount of times it has been sequenced

what is the ability to resolve a repetitive structure dependent on?

the length of the molecules in your library

when analyzing a pileup, you notice that one of the lines has equal parts A and C. What does this mean? Is this an error in the sequence?

this is most likely not an error, this is showing that on the chromosome from one parent you got an A and the other you had a C, this is representing both chromosomes (heterozygous)

Week 6 (QC and Alignment) Flashcards

(56 cards)