5. Genome annotation Flashcards

(35 cards)

1
Q

Repeats & mobile elements in genome annotation - how to deal with them?

A

identify repeats to mask them (or to study them)

Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations.

approaches:
* library-based approaches: compare genome sequence to a library of
known repeats
* signature-based approaches: search for signatures of transposable elements: (LTR, key structural proteins or enzymes, etc)
* (de novo approaches: compare a genome sequence with itself; search for multiple occurrences of k-mers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two types of genome annotation?

A

structural, functional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is structural genome annotation?

A

structural: where are the protein-coding genes?
‘Structural’ - process of identifying genes and their intron–exon structures. (paper)

what is the predicted phenotypic effect of the variant?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is functional genome annotation?

A

(less important for our course)

functional: what is the function of the predicted genes and genetic elements?
‘Functional’ - is the process of attaching meta-data such as gene ontology terms to structural annotations (paper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the different approaches to structural genome annotation?

A

approaches
* intrinsic, ab initio, de novo (only use query)
* extrinsic, homology/evidence-based (use other sequences)
* hybrid / combined / pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three steps for an intrinsic approach to structural genome annotation?

A
  1. collect appropriate training data
  2. build statistical model based on the training data
  3. apply model to the newly assembled genome to predict locations of protein-coding genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Intrinsic approach to genome annotation:

What is appropriate training data?

A
  • genes from the species to be annotated
  • easy for “first generation” (eukaryotic) genomes
  • much more difficult for “second generation” (eukaryotic) genomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What statistical models can be used for intrinsic approaches?

A
  • Hidden Markov Models - important for our course
  • (Bayesian approaches)
  • (Machine Learning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hidden Markov Models

How are they used in genome annotation?
Under what grouping of approaches does this belong?

A

given a DNA sequence, we want to know:
* where does it most likely contain genes?
* what probability is associated with this result?

Training data (genes from species to be annotated) used to build HMM, which can then be applied to a newly assembled genome to predict location of protein-coding genes

an intrinsic approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Prokaryotic gene prediction (intrinsic)

properties of prokaryotic genome?

A

properties
* mostly intron-less genes
* average of 1000 nt per gene (ORF)
* translation start and stop codons for each gene
* some nt biases in coding vs. non-coding regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prokaryotic gene prediction (intrinsic)

using these as a basis for prediction will not work for …?

A
  • small genes
  • partial sequences, incomplete genes
  • sequencing errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Types of HMMs for gene prediction

A

standard HMMs
* each hidden state emits one nt

generalized HMMs (important for course)
(HMMs with duration)
* each hidden state emits a string of nucleotides
can include
* one strand, both strands
* typical & atypical genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

challenges of gene finding for eukaryotic genomes?

A
  • definition of a gene (gene signals)
  • overlapping genes
  • very long or very short genes / exons / introns
  • alternative biological processing
    • alternative splicing
    • alternative polyadenylation
    • alternative initiation of transcription
    • alternative initiation of translation
  • propagation of (annotation) errors in databases
  • sequencing errors
  • incomplete genes on short contigs/scaffolds
  • contamination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is splicing relevant to genome annotation?

A

Nearly all multi-exon human genes are alternatively spliced
* basic alternative splicing patterns vs
* complex alternative splicing patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Challenges of using eukaryotic gene signals in intrinsic approaches to genome annotation

A

Eukaryotic gene signals, a lot of variance –>
not all genes contain the described signals
the signals can occur outside of gene context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some gene signals for Eukaryotes?

A
  • intron-exon structure
    • not all exons contain start/stop codons
    • exons are usually smaller, introns can be quite large
  • splice sites
    • donor site: GT
    • acceptor site: AG
  • transcription signals (CAP, TATA box, termination)
  • translational signals (Kozak signal, termination)
17
Q

challenges of gene finding for eukaryotic genomes

Prediction requirements for exons?

A
  • exons cannot overlap
  • adjacent exons must maintain an open reading
    frame (ORF)
18
Q

What are some Eukaryotic content/compositional features?

A

nucleotide composition
* biases: GC, nts, dinucleotides, hexamers, …
* different in different lineages / species
* different in introns vs exons vs intergenic regions
* different in highly expressed genes
* …

example: codon usage
* codon bias: codons are not used randomly; varies by lineage & species
- e.g., Arginine: CGT, CGC, CGA, CGG, AGA, AGG

19
Q

Some challenges for intrinsic gene finding approaches in eukaryotic genes?

A

Intrinsic gene finding approaches
* structure of eukaryotic genes (e.g., intron-exon structure)
* signals in the sequences (e.g., splice sites, transcriptional and translational signals)
* content statistics and sensors (e.g, nucleotide composition, hexamers, codon usage)

20
Q

What is not available for non-model organisms that makes intrinsic approaches to genome annotation difficult?

A

➡requires species- or lineage-specific training data
- volume & variety
- not available for many non-model organisms

21
Q

What is the name of one HMM approach to genome annotation?

What are the states? What do they emit?

A

Genscan (1997)

States: components (lengths, composition, signals) of a gene

states emit a sequence of variable length according to the state’s sequence composition

22
Q

What is sensitivity in the context of gene prediction?

A

Sensitivity: ability to include correct predictions

Sn

how many nucleotides/exons/genes does the method predict correctly?

23
Q

What is specificity in the context of gene prediction?

A

Specificity: ability to exclude incorrect predictions

Sp

how much of the prediction of
nucleotides/exons/genes is true?

24
Q

Explain extrinsic gene prediction in eukaryotes

What is extrinsic?

What can these approaches not identify?

A

Genes are found based on

similarity with transcripts (RNA-seq, long-read RNA-seq)
* can be used to identify exons and splicing patterns
* problems: paralogs, placing short reads, snapshot!
* preferred & very common approach
similarity with proteins
* close relative’s proteome (e.g., from UniProtKB)
* advantage: may provide information about function
* problems: domains, UTRs, lineage-specific genes?

cannot identify new genes

25
Can you use general purpose sequence similarity tools for extrinsic prediction of eukaryotic genes?
Can't use general purpose sequence similarity tools! no awareness of splice sites, start/stop, reading frames, etc! specific software required!
26
What are some dynamic approaches for gene prediction?
Hybrid methods choosers combiners pipelines
27
What do hybrid methods for gene prediction do?
use both intrinsic and extrinsic methods to predict genes
28
What do combiners do for gene prediction? give an example
combine independent predictions (EvidenceModeler)
29
Example of a pipeline method for gene prediction?
eg MAKER
30
Maker pipeline method for gene prediction: what data is used?
intrinsic: newly sequenced genome (repeats masked) extrinsic: evidence: - RNAseq and/or - protein sequences from selected lineages
31
Maker pipeline method for gene prediction: What are the initial gene predictions based on?
homology
32
Maker pipeline method for gene prediction: steps EXAM QUESTION steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)
1. initial gene predictions (extrinsic) 2. extraction of species-specific content statistics (intrinsic) 3. generation of species-specific HMMs (intrinsic) 4. (refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times 5. final gene predictions
33
How are predicted genes evaluated?
- quantify expected gene content for a given lineage - ineage-specific near-universal single-copy orthologs (BUSCO, https://busco.ezlab.org/) - how many are complete, partial, absent, duplicated?
34
What is BUSCO
Benchmarking Universal Single-Copy Orthologs) scores - look for presence/absence of highly conserved genes in assembly. - aim highest percentage of genes identified in assembly - BUSCO complete score above 95% considered good
35
EXAM QUESTION Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019) What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)
Intrinsic: just use query data (genome/s), build statistical model eg HMM Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages Disadvantage intrinsic: need a lot of training data, not possible for non-model organisms extrinsic: difficult to predict new genes