Flashcards
Main difference between first and second generation sequencing
First generation (Sanger sequencing) can only sequence one fragment at a time, while second generation can perform parallel sequencing of multiple fragments
Paired-end sequencing
Both ends of the same DNA fragment is sequenced
Main difference between second and third generation sequencing
In third generation sequencing, sequencing is done for individual DNA molecules. There is no amplification step which it is in 2nd generation
Duplicate (sequencing error)
Are caused by sequencing the same physical DNA fragment multiple times. The reads then all come from the same DNA molecule - don’t describe the true diversity in the sample. Duplicates are often caused by biases in the amplification step
Factors contributing to errors in Illumina sequencing
- Read position: probability of error increases for each sequenced bp
- T has higher error rate than the other nucleotides. GC-rich patterns have high error rate
- First read has lower error rate than second (for paired-end sequencing)
Site specific error (SSE)
Errors that depend on the sequence of the site where the error has occured (example: GC-rich regions)
Steps involved in pre-processing of NGS data
Pre-processing is used to “clean” the data.
- Identifies erroneous reads and bp
- Cleans data by removing errors, using for example filtering or trimming
Coverage
The number of times a nucleotide in the reference in “covered” by reads
Purpose of a variant caller of SNPs
Variant calling aims to identify SNPs in the sequenced genome compared to the reference, and then to distinguish between true mutations and sequencing errors. A good caller should have a high sensitivity (find all true mutaitons) and a high specificity (ignore all false positives)
GATK
Stands for “Genome analysis toolkit” and contains the unified genotyper, which is an advanced mutation caller
Post-processing (genome sequencing)
After SNP variant calling, extra filtering might be needed. Example: sequencing errors only discovered at the end of the reads, or in one certain read direction
Global alignment
Two sequences are aligned over their full length. Can use the Needleman-Wunsch algorithm
Local alignment
Two sequences (often of substantially different lengths) are aligned based on their best matching subsequences. Can use the Smith-Waterman algorithm (modifies NW)
Steps in analysis of genome sequencing data
Pre-processing, read mapping, quality refinement, variant calling
Quality refinement (genome sequencing)
Quality refinement is the step that comes after read mapping but before variant calling in genome sequencing. The quality refinement step aims to remove errors in the data and errors introduced in the read mapping.
Three main steps in analysis of RNA seq data
- Quatification of the gene expression
- Normalization
- Identification of differentially expressed genes
Splice-aware mapper
When mapping RNA-seq reads to a genome, the mapper needs to be able to handle splicing. In other words, the mapper should be allowed to make large gaps, corresponding to introns
Multiple matches (RNA-seq)
One read matches two or more different regions. Can be explained by multiple similar regions in the genome, but also by errors.
Semiquantitative
The quantitative data is relative and therefore influenced by for example one gene being substantially more expressed than others
Which are the three statistical approaches to identify DEGs?
- Methods based on normal assumptions
- Methods based on non-parametric methods
- Methods based on count distributions
Family-wise error rate
The FWER is the probability of at least one false positive, and is equal to 1 – (1 - α)^m
Bonferroni correction
Divide the significance level α by the number of performed tests m. Then use the cut-off α/m instead.
A Bonferroni adjusted p-value can be calculated by multiplying each p-value with m.
Bonferroni corrected p-values always control the FWER
False discovery rate
FDR is the number of false positives in relation to the total number of rejected null hypotheses (significant tests)
Benjamini-Hochberg correction
Order the p-values, then multiply each p-value with the number of tests and divide by its position.
Benjamini-Hochberg correction controls the FDR