Finals Flashcards

(100 cards)

1
Q

Functional genomics

A

The functional annotation of genes is a large field that utilizes extensive experimentation to describe the function and interactions of gene and gene products

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

For functional annotation, what is BLAST and InterPro Software framework based on?

A

Sequence similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Examples of functional classification schemes

A

Gene Ontology (GO)
Enzyme Commission (EC) Numbers
Kyoto Encyclopedia of Genes & Genomes (KEGG) BRITE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many classification schemes have been devised for protein structures and what are they?

A

Three:

SCOP (Structural Classification of Proteins)
CATH (Class, Architecture, Topology, Homologous superfamily)
FSSP (Families of structurally similar proteins)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the success or reliability of functional prediction influenced by?

A

Accuracy of the alignment of homologous characters in two or more sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the twilight zone?

A

Sequence similarity between two protein sequences is 15-25%, and the reliability of the prediction that two proteins are homologous, or evolutionarily related is only 10%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the percent identity that might occur between two protein sequences of longer than 100 amino acids simply by chance?

A

10-20%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the reliability of prediction that two protein sequences are homologous when the sequence identity is above 30%?

A

90%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

By what percentage of amino acids in the sequence is the protein fold determined which determines the general structure of a protein?

A

3-4%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the likely sequence similarity of proteins with similar structure?

A

> 33%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the midnight zone?

A

Sequence identity is very low <15%, sequences are so different that the relationship is nearly invisible at sequence level, but may adopt very similar 3D structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What percentage of gene annotations in public databases are incorrect or misleading?

A

5-63%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are the errors in gene annotations in public databases propagated?

A

Via analyses of new genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Where do the errors in gene annotations arise from?

A

They originate from various sources including genome assembly and gene prediction.

Genome assembly: Erroneous or incomplete genome assembly - Truncated or chimeric genes
Genes and gene function prediction: Single nucleotide errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which databases are the best-curated for protein functional annotations and why?

A

RefSeq
UniProt/SwissProt

They require multiple lines of experimentally derived evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which sequence databases are integrated in the InterPro framework?

A

HAMAP, Panther, PIRSF, TIGRFAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which method to predict signal peptide is integrated in the InterPro framework?

A

SignalP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which method to predict transmembrane region is integrated in the InterPro framework?

A

TMHMM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which fingerprint databases are integrated in the InterPro framework?

A

PRINTS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which motif databases are integrated in the InterPro framework?

A

ProSite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which domain databases are integrated in the InterPro framework?

A

Gene3D, Pfam, ProDom, ProSite (Profile), SMART, Superfamily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The sensitivity of BLAST is comparable to what algorithm?

A

Smith-Waterman

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can BLAST recognize distant homologues?

A

An iterative algorithm using a position specific score matrix is devised and implemented in PSI-BLAST. A matrix is reconstructed for individual iterations using sequences from previous iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What could lead to an erroneous transfer of function in BLAST-based annotation methods?

A

Homologues may align only over a small portion of their overall lengths.

Homologue may have been wrongly annotated in the first place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Issues with BLAST-based annotation methods
1. Distant homologues 2. Homologues may only align over a small portion of the overall lengths 3. Misannotated homologues
26
What is commonly used to predict orthologous proteins from KEGG databases?
BLAST
27
What can mis-annotation of homologous proteins may also lead to in case of orthologs?
Orthologs are predicted from KEGG databases, and misannotation may lead to erroneous predictions of metabolic pathways and protein families
28
A server that incorporates a curated domain-family database
PfamA
29
A server that incorporates a computationally-generated domain-family database
PfamB
30
How does Pfam generate clusters of domain families?
Defined by the program ADDA
31
How are clusters of domain families formed in ADDA?
From pairwise comparisons of profiles of domains inferred by penalizing splits and partial overlaps in pairwise, BLAST-aligned, protein-similarity matrix
32
How does SMART domain database work?
Simple Modular Architecture Research Tool requires manual intervention during annotation and is linked to a database called STRING (Search Tool for the Retrieval of Interacting Genes)
33
How does ProDom domain database work?
Comparing the results from PSI-BLAST against the UniProtKB database and inferring domain information from the resultant data
34
Which domain databases does ProDom complement?
Pfam, ProSite, SMART
35
How does SUPERFAMILY resource for domains work?
It uses the SCOP classification scheme for inferred protein-domain superfamilies and assigns gene ontology (GO) terms to these families using Gene Ontology annotation
36
How does Gene3D resource for domains work?
It combines both structural (CATH classification scheme) and functional information to annotate domains found in sequences in the databases UniProtKB, RefSeq, Ensembl. It clusters annotated superfamilies into functional subfamilies using GeMMA.
37
How does the ProSite motif database work?
Recognizes protein motifs using regular expressions and weight matrix profiles, augmented by the annotation rule database ProRule.
38
What does the implementation of ProRule in ProSite does?
It increases the reliability by imposing rules, such as essential amino acids in the active sites of enzymes.
39
What are some clustering methods and databases?
Methods: OrthoMCL, InParanoid, MultiParanoid Databases: OrthoDB, Clusters of Orthologous Groups of Proteins
40
What are the clustering methods and databases based on?
They use the all-versus-all similarity metrices, created based on the pairwise alignments of protein sequences using algorithms such as BLAST, FASTA, Smith-Waterman
41
What is the largest, publicly available, all-versus-all protein sequence similarity score matrix called?
Similarity Matrix of Proteins (SIMAP)
42
What is SIMAP2 limited to?
Proteins encoded in complete genome sequences (Not publicly available)
43
In which database is SIMAP2 employed?
eggNOG: evolutionary geneology of genes: Non-supervised Orthologous Groups
44
how are the proteins assembled in eggNOG?
They are assembled into in-paralogous (as opposed to out-paralogous and orthologous) groups by comparing sequence similarities within and among clades
45
How are orthologous groups found from eggNOG?
Orthologous groups amongst the in-paralogous groups in eggNOG are then identified by creating and merging reciprocal best hits among three species
46
How can clustering methods be improved to predict orthologues and paralogues?
Amino acid substitution models like BLOSUM can be replaced with models that better estimate phylogenetic distances, such as JTT, WAG and by reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species.
47
In which methods or databases is the reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species accounted for?
SYNERGY PhIG (Phylogenetically inferred groups) TreeFam PANTHER
48
What are the problems with databases with phylogenomic annotation algorithms?
They provide a compelling option to rapidly detect protein function, but they are limited in: 1. Their coverage of species and proteins 2. Using sequence similarity searches to position the query sequence in phylogenetic trees in the databases, constructed using substitution models and taxonomic information seems questionable 3. Even perfect positioning does not guarantee the accurate prediction of function for the query protein sequence, because homologous proteins do not always have the same function
49
How do we annotate proteins based on structure?
We compare the predicted folds of the gene products against structurally similar proteins in databases such as protein data bank (PDB)
50
What are the limitations of annotating protein functions based on structure?
1. Only 60% of structurally similar proteins without significant sequence similarity share a binding site location, thus the function inferred from this comparison may not always be correct. 2. Moreover, functional knowledge about a lot of the 3D structures of proteins in PDB is lacking as structural genomics initiatives are only directed at determining the 3D structures through high-throughput structure determination efforts. 3. In convergent evolution, the same function is observed even with different folds, thus preventing the use of structural homologues to infer a function
51
What should be done to increase the accuracy of structure-based function prediction and why?
Conserved amino acids in active and binding sites need to be evaluated. That is because, for enzymes, catalytic residues and their locations within the protein and orientation within the active sites are usually conserved and are not associated with structural variation, thereby allowing the functional annotation of distantly related homologues.
52
How to identify the conserved residues to improve the accuracy of structure-based function prediction?
The identification of conserved residues in protein families is through multiple sequence alignment
53
Where can the functional classification of proteins be evaluated?
In the annual Critical Assessment of Function Annotation (CAFA) challenge
54
When is promising annotation achieved?
When using machine learning and supervised classification methods, and unsupervised clustering methods
55
In what databases are the results for experimentally evaluated and computationally predicted protein-protein interaction networks and protein-protein complexes found?
DIP STRING
56
Where can machine learning and supervised classification methods, with unsupervised clustering methods be applied?
It can be applied to predict individual features of proteins (domain boundaries, subcellular location, conserved residues), to collectively predict a function with data integrated from different sources (structure, taxonomy, sequence, transcription, metabolic and protein-protein interaction networks). Or to enhance an existing homology based annotation.
57
What does gene prediction or structural annotation or gene finding mean?
Aims to identify structural elements in a genomic region that represent a gene.
58
What does extrinsic methods for gene prediction do?
They align transcriptomic, protein sequence, and/or other evidence datasets to the genomic sequence for gene prediction
59
What does intrinsic methods for gene prediction do?
They use statistical patterns to identify gene regions in a genomic sequence
60
What is the predicted gene element data typically represented by?
A unified general feature format (GFF)
61
What is a general pipeline for gene prediction and functional annotation?
RNA-Seq reads -> Transcriptome assembly -> Transcript sequences (Protein sequences + Genome scaffolds) -> Gene prediction -> Gene annotation (InterPro: Domains, motifs, signal peptides) -> Post-processing
62
For extrinsic methods, how are genes predicted?
Based on the alignment success
63
For accurately predicting a gene structure with extrinsic methods, what sequences are preferred?
cDNA sequences
64
What is native alignment in the context of aligning an evidence dataset to a genomic sequence?
mRNA sequence as the evidence dataset typically are derived from the same species under investigation and match the genome sequence
65
What is trans-alignment in the context of aligning an evidence dataset to a genomic sequence?
Protein sequence as the evidence dataset are from closely related species and are not expected to match the conceptually translated genomic sequences
66
What are the challenges for extrinsic methods?
1. Alignment inaccuracies 2. Fragmented nature of evidence (mRNA or protein sequences) data 3. Splice variants from genes
67
Why is Exonerate algorithm widely used to align for extrinsic methods of gene prediction?
1. Process data relatively rapidly 2. Align both protein and nucleotide sequences
68
Which aligners align evidence data accurately across exons and introns?
Pair HMM aligners, such as Pairagon and GeneWise
69
What is the disadvantage of using Pair HMM aligners?
Large computational time
70
Examples of alignment algorithms that use BLAST to produce seed alignments which are then extended using different dynamic programming variants such as Needleman-Wunsch or Smith-Waterman algorithms
EST_GENOME, AAT, Exonerate
71
What are consensus based methods in intrinsic gene prediction?
Consensus based methods, also known as signal sensors, predict known nucleotide patterns in gene elements. These methods look for specific, well-defined sequences that indicate important functional sites in DNA such as: Splice sites, start and stop codons, and kozak consensus sequence (related to the initiation of translation)
72
What sites do consensus based methods look for?
Well known pattern in gene elements such as kozak consensus sequence, start and stop codons, splice sites
73
Which methods are used to recognize the signals in consensus based methods in intrinsic gene prediction?
Methods utilizing the Weighed Matrix Method (WMM) such as Position Weight Matrix (PWM), Weighed Array Model (WAM), Maximal Dependence Decomposition (MDD), Windowed weight array model (WWAM)
74
How does weighed matrix method (WMM) work?
Calculates the signal probability and assumes that individual nucleotides are independent
75
How does weighed array model (WAM) work?
Assumes dependencies between adjacent nucleotides
76
How does maximal dependence decomposition (MDD) work?
Implements a decision tree of weighed matrix method (WMM) and extends the dependency considerations across non-adjacent nucleotides
77
How does windowed weight array model (WWAM) work?
Assumes dependencies across three consecutive nucleotides and averages related conditional probabilities among five consecutive nucleotides
78
What are non-consensus based methods in intrinsic gene prediction?
Use nucleotide composition (content) to recognize gene elements and sequence areas (coding and non coding regions)
79
What is the most successful discriminator between coding and non-coding regions when predicting nucleotide by nucleotide in non-consensus intrinsic gene prediction?
Hidden Markov Models using hexamer sequence composition
80
To extend the prediction capability of single nucleotide approach (HMMs with hexamer sequence composition to discriminate between coding and NC regions) to versatile gene elements or even complete gene structures, how are the prediction algorithms are enhanced?
Three-period, fifth-order generalized HMMs (GHMMs): Hexamer sequences are used + Together with the built-in knowledge of codon structure to ensure the preservation of a reading frame
80
Examples of programs using GHMM based three-period fifth-order Markov Chain model
GENSCAN, GeneMark-ES
81
Which Markov models are used to further improve predictions from GHMM based Markov Chain models?
Interpolated Markov Models (IMM) in which Markov models of different order are interpolated
82
Which gene finders implement interpolated markov model?
AUGUSTUS, GlimmerHMM
83
How has Ab initio prediction algorithms been enhanced?
It has been enhanced using information from syntenic (=colocalized) regions among multiple genomes. It is advisable to employ genomes from taxonomically closely related species.
84
How to create functional prediction models for Ab initio gene prediction?
Ab initio predictors have to be trained with reliable training datasets, which are specific to each genome
85
What to do if training data is not available for a specific genome while creating a functional prediction model for Ab initio gene prediction?
Parameter values for prediction models can be estimated by predicting genes first using suboptimal parameter values, and then by recalculating new values based on these predicted genes
86
What are suboptimal parameter values?
Copied from prediction models for closely related species; Inferred from the structure of core eukaryotic genes Obtained from unsupervised gene prediction programs (Such as GeneMark)
87
What was the first attempt to combine the prediction data from multiple sources
Using the program COMBINER (Linear and statistical combinations of the prediction data from multiple sources)
88
What is the successor of COMBINER and what are its ab initio and evidence based on? How does it work?
JIGSAW. ab initio: Internal support with GHMMs evidence: Expresses external evidence of structural elements of a gene using feature vectors. Feature vectors give a weighting coefficient to each prediction source, and dynamic programming (combined with decision trees) is used to establish optimal gene structures
89
Ensembl
Combined gene prediction program Prefers evidence based over ab initio High quality annotations at the cost of sensitivity
90
EVIGAN
Combined gene prediction program Predicts gene structures using Dynamic Bayes networks Estimated with Maximum Likelihood
91
GLEAN
Combined gene prediction program Uses latent class analysis (LCA) algorithm to give consensus predictions Gene structures are predicted from gene structural elements
92
MAKER2
Combined gene prediction program Uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction
93
What is the advantage of using MAKER2?
It can estimate the reliability of any prediction as it uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction
94
EVM
Combined gene prediction program Accommodates the use of variable of gene prediction and evidence data, allows for manual weight adjustment of each data source
95
What are orthologues and paralogues inferred by in InParanoid and MultiParanoid?
Pairwise reciprocal
96
What pairwise similarity matrix does MultiParanoid use?
InParanoid
97
What pairwise similarity matrix does InParanoid use?
BLAST based
98
How are orthologues and paralogues inferred in orthomcl?
Markov Clustering Algorithm
99