Genomics: Reference Mapping Flashcards
quality of HGS Reads and SNP Analyses (40 cards)
What are the different sequencing read formats?
Sequencing read formats are file types used to store DNA sequence data generated by sequencing technologies.
Sequence Reads: What are the 2 formats?
Binary Standard Flowgram Format (SFF): used for sequencing data from older platforms like Roche 454.
Includes sequence data, quality scores, and flowgram information.
FASTQ Format: A text-based format widely used in modern sequencing technologies like Illumina.
Contains four lines per read - Readable, widely compatible with bioinformatics tools, and contains both sequence and quality information.
Why is it important to have quality score in sequencing reading?
Quality scores indicate the accuracy of each base call in sequencing reads. They help in detecting errors, filtering low-quality data, improving confidence in variant calling, optimizing data usage, and ensuring consistency across experiments. A higher quality score means greater confidence in the sequencing data.
What does a Q30 score mean in sequencing?
A Q30 score means there is a 1 in 1,000 chance of an incorrect base call, representing 99.9% accuracy for that base.
How are quality scores used in filtering sequencing data?
Quality scores allow researchers to remove low-quality bases or reads from data to ensure only high-confidence sequences are used in analysis, reducing errors in downstream processes.
Why are quality scores important for variant calling?
Quality scores provide confidence in the base calls, helping distinguish between true genetic variants and sequencing errors, which is critical for accurate mutation detection.
How do quality scores improve comparability across experiments?
By providing a standard measure of sequencing accuracy, quality scores ensure that data from different experiments or platforms can be compared and assessed consistently.
QC: What is the per base quality distribution?
QC: When would you see poor distribution?
The per-base quality distribution shows the range of quality scores for all bases at each position in the sequencing reads. It helps evaluate the reliability of the sequencing data by visualizing how quality varies across the length of the read.
When would you see poor quality distribution?
Poor quality distribution is typically observed:
- At the ends of long reads: Quality often decreases due to instrument or chemistry limitations.
- In degraded samples: DNA degradation or poor library preparation can reduce overall quality.
- After long sequencing runs: General quality degradation over time during sequencing.
- In reads with high error rates: Caused by factors like incorrect adapter ligation or contamination.
How can poor quality distribution be addressed?
Poor quality can be mitigated by applying quality trimming, which removes low-quality bases (e.g., from the ends of reads) before downstream analysis.
QC: Why is per tile sequence quality useful?
It helps detect localized problems on the flow cell, such as:
Uneven illumination.
Contamination or damage.
This ensures problematic tiles are excluded from downstream analysis.
How is per tile quality assessed?
It is visualized using heat maps or graphs that show quality distribution. Uniform colour indicates consistent quality, while deviations highlight problem areas.
What is the role of quality scores in quality control?
Quality scores help identify low-quality reads caused by poor imaging or technical issues during sequencing. These scores ensure only high-quality reads are used for downstream analysis.
Why must low-quality reads be removed from downstream analysis?
Low-quality reads can introduce errors (e.g., wrongly called bases) that bias the results of downstream processes, such as alignment, assembly, or variant calling.
How are low-quality reads identified?
Low-quality reads are identified using in-built software tools (e.g., FastQC) that evaluate quality scores and flag sequences with overall poor quality values.
What could cause poor-quality reads?
Poor-quality reads may result from:
Poor imaging during sequencing runs.
Technical issues with the sequencing machine.
Degradation of the DNA sample.
How are quality scores typically visualized?
Quality scores are often displayed in a graph, such as a histogram, where sequences with high-quality scores cluster toward the higher end of the graph, and low-quality sequences are flagged.
What does a sharp peak in GC content indicate?
A sharp peak in the GC content distribution suggests:
Possible contamination, such as adapter dimers or non-target sequences.
Overrepresentation of specific sequences in the library.
What could overrepresentation of a single sequence in the GC content plot indicate?
It may indicate the library is:
Contaminated with specific sequences (e.g., from adapters).
Imbalanced, such as from amplification biases during library preparation.
How is per base GC content different from per sequence GC content?
Per base GC content measures GC content at each position across all reads, helping detect biases at specific regions.
Per sequence, GC content evaluates the overall GC composition of each read compared to a normal distribution.
What does a normal GC content distribution look like?
n a normal library, GC content follows a smooth, bell-shaped curve centered around the average GC percentage of the species being sequenced.
Why is it necessary to remove adapter sequences?
Adapter sequences need to be removed because:
- They contaminate sequencing data.
- They interfere with alignment and downstream analyses.
- They introduce bias, affecting the accuracy of results.
What is the role of FastQC in adapter sequence removal?
FastQC flags overrepresented sequences (including adapters) and provides details about the type of adapters present (e.g., Illumina Universal Adapter). Users can then use trimming software to remove them.