Finals Flashcards

Question

Issues with BLAST-based annotation methods

Answer 1

1. Distant homologues 2. Homologues may only align over a small portion of the overall lengths 3. Misannotated homologues

Answer 2

Orthologs are predicted from KEGG databases, and misannotation may lead to erroneous predictions of metabolic pathways and protein families

Answer 3

Defined by the program ADDA

Answer 4

From pairwise comparisons of profiles of domains inferred by penalizing splits and partial overlaps in pairwise, BLAST-aligned, protein-similarity matrix

Answer 5

Simple Modular Architecture Research Tool requires manual intervention during annotation and is linked to a database called STRING (Search Tool for the Retrieval of Interacting Genes)

Answer 6

Comparing the results from PSI-BLAST against the UniProtKB database and inferring domain information from the resultant data

Answer 7

Pfam, ProSite, SMART

Answer 8

It uses the SCOP classification scheme for inferred protein-domain superfamilies and assigns gene ontology (GO) terms to these families using Gene Ontology annotation

Answer 9

It combines both structural (CATH classification scheme) and functional information to annotate domains found in sequences in the databases UniProtKB, RefSeq, Ensembl. It clusters annotated superfamilies into functional subfamilies using GeMMA.

Answer 10

Recognizes protein motifs using regular expressions and weight matrix profiles, augmented by the annotation rule database ProRule.

Answer 11

It increases the reliability by imposing rules, such as essential amino acids in the active sites of enzymes.

Answer 12

Methods: OrthoMCL, InParanoid, MultiParanoid Databases: OrthoDB, Clusters of Orthologous Groups of Proteins

Answer 13

They use the all-versus-all similarity metrices, created based on the pairwise alignments of protein sequences using algorithms such as BLAST, FASTA, Smith-Waterman

Answer 14

Similarity Matrix of Proteins (SIMAP)

Answer 15

Proteins encoded in complete genome sequences (Not publicly available)

Answer 16

eggNOG: evolutionary geneology of genes: Non-supervised Orthologous Groups

Answer 17

They are assembled into in-paralogous (as opposed to out-paralogous and orthologous) groups by comparing sequence similarities within and among clades

Answer 18

Orthologous groups amongst the in-paralogous groups in eggNOG are then identified by creating and merging reciprocal best hits among three species

Answer 19

Amino acid substitution models like BLOSUM can be replaced with models that better estimate phylogenetic distances, such as JTT, WAG and by reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species.

Answer 20

SYNERGY PhIG (Phylogenetically inferred groups) TreeFam PANTHER

Answer 21

They provide a compelling option to rapidly detect protein function, but they are limited in: 1. Their coverage of species and proteins 2. Using sequence similarity searches to position the query sequence in phylogenetic trees in the databases, constructed using substitution models and taxonomic information seems questionable 3. Even perfect positioning does not guarantee the accurate prediction of function for the query protein sequence, because homologous proteins do not always have the same function

Answer 22

We compare the predicted folds of the gene products against structurally similar proteins in databases such as protein data bank (PDB)

Answer 23

1. Only 60% of structurally similar proteins without significant sequence similarity share a binding site location, thus the function inferred from this comparison may not always be correct. 2. Moreover, functional knowledge about a lot of the 3D structures of proteins in PDB is lacking as structural genomics initiatives are only directed at determining the 3D structures through high-throughput structure determination efforts. 3. In convergent evolution, the same function is observed even with different folds, thus preventing the use of structural homologues to infer a function

Answer 24

Conserved amino acids in active and binding sites need to be evaluated. That is because, for enzymes, catalytic residues and their locations within the protein and orientation within the active sites are usually conserved and are not associated with structural variation, thereby allowing the functional annotation of distantly related homologues.

Answer 25

The identification of conserved residues in protein families is through multiple sequence alignment

Answer 26

In the annual Critical Assessment of Function Annotation (CAFA) challenge

Answer 27

When using machine learning and supervised classification methods, and unsupervised clustering methods

Answer 28

DIP STRING

Answer 29

It can be applied to predict individual features of proteins (domain boundaries, subcellular location, conserved residues), to collectively predict a function with data integrated from different sources (structure, taxonomy, sequence, transcription, metabolic and protein-protein interaction networks). Or to enhance an existing homology based annotation.

Answer 30

Aims to identify structural elements in a genomic region that represent a gene.

Answer 31

They align transcriptomic, protein sequence, and/or other evidence datasets to the genomic sequence for gene prediction

Answer 32

They use statistical patterns to identify gene regions in a genomic sequence

Answer 33

A unified general feature format (GFF)

Answer 34

RNA-Seq reads -> Transcriptome assembly -> Transcript sequences (Protein sequences + Genome scaffolds) -> Gene prediction -> Gene annotation (InterPro: Domains, motifs, signal peptides) -> Post-processing

Answer 35

Based on the alignment success

Answer 36

cDNA sequences

Answer 37

mRNA sequence as the evidence dataset typically are derived from the same species under investigation and match the genome sequence

Answer 38

Protein sequence as the evidence dataset are from closely related species and are not expected to match the conceptually translated genomic sequences

Answer 39

1. Alignment inaccuracies 2. Fragmented nature of evidence (mRNA or protein sequences) data 3. Splice variants from genes

Answer 40

1. Process data relatively rapidly 2. Align both protein and nucleotide sequences

Answer 41

Pair HMM aligners, such as Pairagon and GeneWise

Answer 42

Large computational time

Answer 43

EST_GENOME, AAT, Exonerate

Answer 44

Consensus based methods, also known as signal sensors, predict known nucleotide patterns in gene elements. These methods look for specific, well-defined sequences that indicate important functional sites in DNA such as: Splice sites, start and stop codons, and kozak consensus sequence (related to the initiation of translation)

Answer 45

Well known pattern in gene elements such as kozak consensus sequence, start and stop codons, splice sites

Answer 46

Methods utilizing the Weighed Matrix Method (WMM) such as Position Weight Matrix (PWM), Weighed Array Model (WAM), Maximal Dependence Decomposition (MDD), Windowed weight array model (WWAM)

Answer 47

Calculates the signal probability and assumes that individual nucleotides are independent

Answer 48

Assumes dependencies between adjacent nucleotides

Answer 49

Implements a decision tree of weighed matrix method (WMM) and extends the dependency considerations across non-adjacent nucleotides

Answer 50

Assumes dependencies across three consecutive nucleotides and averages related conditional probabilities among five consecutive nucleotides

Answer 51

Use nucleotide composition (content) to recognize gene elements and sequence areas (coding and non coding regions)

Answer 52

Hidden Markov Models using hexamer sequence composition

Answer 53

Three-period, fifth-order generalized HMMs (GHMMs): Hexamer sequences are used + Together with the built-in knowledge of codon structure to ensure the preservation of a reading frame

Answer 54

GENSCAN, GeneMark-ES

Answer 55

Interpolated Markov Models (IMM) in which Markov models of different order are interpolated

Answer 56

AUGUSTUS, GlimmerHMM

Answer 57

It has been enhanced using information from syntenic (=colocalized) regions among multiple genomes. It is advisable to employ genomes from taxonomically closely related species.

Answer 58

Ab initio predictors have to be trained with reliable training datasets, which are specific to each genome

Answer 59

Parameter values for prediction models can be estimated by predicting genes first using suboptimal parameter values, and then by recalculating new values based on these predicted genes

Answer 60

Copied from prediction models for closely related species; Inferred from the structure of core eukaryotic genes Obtained from unsupervised gene prediction programs (Such as GeneMark)

Answer 61

Using the program COMBINER (Linear and statistical combinations of the prediction data from multiple sources)

Answer 62

JIGSAW. ab initio: Internal support with GHMMs evidence: Expresses external evidence of structural elements of a gene using feature vectors. Feature vectors give a weighting coefficient to each prediction source, and dynamic programming (combined with decision trees) is used to establish optimal gene structures

Answer 63

Combined gene prediction program Prefers evidence based over ab initio High quality annotations at the cost of sensitivity

Answer 64

Combined gene prediction program Predicts gene structures using Dynamic Bayes networks Estimated with Maximum Likelihood

Answer 65

Combined gene prediction program Uses latent class analysis (LCA) algorithm to give consensus predictions Gene structures are predicted from gene structural elements

Answer 66

Combined gene prediction program Uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction

Answer 67

It can estimate the reliability of any prediction as it uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction

Answer 68

Combined gene prediction program Accommodates the use of variable of gene prediction and evidence data, allows for manual weight adjustment of each data source

Answer 69

Pairwise reciprocal

Answer 70

InParanoid

Answer 71

BLAST based

Answer 72

Markov Clustering Algorithm

Finals Flashcards

(100 cards)