Genomics: Reference Mapping Flashcards

Question

QC: What is per base sequence content?

Answer 1

A

Sequencing read formats are file types used to store DNA sequence data generated by sequencing technologies.

Answer 2

A

Binary Standard Flowgram Format (SFF): used for sequencing data from older platforms like Roche 454.

Includes sequence data, quality scores, and flowgram information.

FASTQ Format: A text-based format widely used in modern sequencing technologies like Illumina.

Contains four lines per read - Readable, widely compatible with bioinformatics tools, and contains both sequence and quality information.

Answer 3

A

Quality scores indicate the accuracy of each base call in sequencing reads. They help in detecting errors, filtering low-quality data, improving confidence in variant calling, optimizing data usage, and ensuring consistency across experiments. A higher quality score means greater confidence in the sequencing data.

Answer 4

A

A Q30 score means there is a 1 in 1,000 chance of an incorrect base call, representing 99.9% accuracy for that base.

Answer 5

A

Quality scores allow researchers to remove low-quality bases or reads from data to ensure only high-confidence sequences are used in analysis, reducing errors in downstream processes.

Answer 6

A

Quality scores provide confidence in the base calls, helping distinguish between true genetic variants and sequencing errors, which is critical for accurate mutation detection.

Answer 7

A

By providing a standard measure of sequencing accuracy, quality scores ensure that data from different experiments or platforms can be compared and assessed consistently.

Answer 8

A

The per-base quality distribution shows the range of quality scores for all bases at each position in the sequencing reads. It helps evaluate the reliability of the sequencing data by visualizing how quality varies across the length of the read.

Answer 9

A

Poor quality distribution is typically observed:

At the ends of long reads: Quality often decreases due to instrument or chemistry limitations.
In degraded samples: DNA degradation or poor library preparation can reduce overall quality.
After long sequencing runs: General quality degradation over time during sequencing.
In reads with high error rates: Caused by factors like incorrect adapter ligation or contamination.

Answer 10

A

Poor quality can be mitigated by applying quality trimming, which removes low-quality bases (e.g., from the ends of reads) before downstream analysis.

Answer 11

A

It helps detect localized problems on the flow cell, such as:

Uneven illumination.
Contamination or damage.

This ensures problematic tiles are excluded from downstream analysis.

Answer 12

A

It is visualized using heat maps or graphs that show quality distribution. Uniform colour indicates consistent quality, while deviations highlight problem areas.

Answer 13

A

Quality scores help identify low-quality reads caused by poor imaging or technical issues during sequencing. These scores ensure only high-quality reads are used for downstream analysis.

Answer 14

A

Low-quality reads can introduce errors (e.g., wrongly called bases) that bias the results of downstream processes, such as alignment, assembly, or variant calling.

Answer 15

A

Low-quality reads are identified using in-built software tools (e.g., FastQC) that evaluate quality scores and flag sequences with overall poor quality values.

Answer 16

Study These Flashcards

A

Poor-quality reads may result from:

Poor imaging during sequencing runs.

Technical issues with the sequencing machine.

Degradation of the DNA sample.

Answer 17

Study These Flashcards

A

Quality scores are often displayed in a graph, such as a histogram, where sequences with high-quality scores cluster toward the higher end of the graph, and low-quality sequences are flagged.

Answer 18

Study These Flashcards

A

A sharp peak in the GC content distribution suggests:

Possible contamination, such as adapter dimers or non-target sequences.

Overrepresentation of specific sequences in the library.

Answer 19

Study These Flashcards

A

It may indicate the library is:

Contaminated with specific sequences (e.g., from adapters).

Imbalanced, such as from amplification biases during library preparation.

Answer 20

Study These Flashcards

A

Per base GC content measures GC content at each position across all reads, helping detect biases at specific regions.

Per sequence, GC content evaluates the overall GC composition of each read compared to a normal distribution.

Answer 21

Study These Flashcards

A

n a normal library, GC content follows a smooth, bell-shaped curve centered around the average GC percentage of the species being sequenced.

Answer 22

Study These Flashcards

A

Adapter sequences need to be removed because:

They contaminate sequencing data.
They interfere with alignment and downstream analyses.
They introduce bias, affecting the accuracy of results.

Answer 23

Study These Flashcards

A

FastQC flags overrepresented sequences (including adapters) and provides details about the type of adapters present (e.g., Illumina Universal Adapter). Users can then use trimming software to remove them.

Genomics: Reference Mapping Flashcards

quality of HGS Reads and SNP Analyses (40 cards)