Lecture 5&6: Protein Sequence Alignment Flashcards
(19 cards)
What are Orthologs?
Copies resulting from a gene duplication event that come from speciation.
What are Paralogs?
Copies resulting from a gene duplication event within the same organism.
What is the formula for percentage sequence identity?
Number of identical residues/number of residues in smallest protein) * 100
What are the general steps to solve the function of a protein?
- Do fast scans using approximate methods. (BLAST)
- Align proteins using a dynamic programming method (Needleman & Wunsch, Smith & Waterman)
- Scan against sequence profiles or HMMs in secondary databases (Pfam, InterPro)
- Align sequence against family relatives using ClustalW, Jalview
What is the difference between Needleman & Wunsh and Smith & Waterman algorithms?
Needleman & Wunsch uses Global Alignment
Smith and Waterman uses Local Alignment
What are some rules of sequence homology?
Protein pairs having more than 150 residues are homologs if they have a sequence identity > 25%
For shorter fragment proteins, 30% sequence identity is required.
Structure within families tends to be much more conserved compared to sequence.
Inheriting functional properties from a homolog requires around 60% sequence identity.
What are the different matrices used when comparing to proteins?
Identity Matrix (Binary)
Physicochemical properties matrix (range)
Evolutionary matrices (Dayhoff, BLOSUM matrices)
What is the Dayhoff matrix?
It is an evolutionary matrix.
It measures the evolutionary distance by determining the number of point accepted mutations, where 1 PAM = 1 point mutation/100 residues
if more than 100PAM, it means multiple substitutions have occurred to the same site.
What is the BLOSUM matrix?
It is an evolutionary matrix.
It is derived from analyzing substitution patterns in more distant relatives.
What is the difference between a p-value and an e-value?
the p-value is the likelihood that this match was obtained by chance, which is converted to an e-value, which takes into consideration the size of the database.
What types of residues are most conserved?
Catalytic residues are the most highly conserved residues. Others could include residues in the binding pocket, the surface of a protein.
Highly conserved residues are usually associated with the function.
What is progressive alignment?
It is a heuristic approach that uses the idea that sequences are evolutionarily related and can be aligned using an underlying phylogenetic tree.
What are the features of the Clustal W algorithm?
It has position specific gap opening and extension penalties (higher within strands and helices, lower between them).
It uses two different amino acid substitution matrices: one for close relatives, one for distant.
What are some alternatives to Clustal W?
MAFFT
T-Coffee
MUSCLE
JALVIEW
How can conservation be measured?
While there are various methods to measure the magnitude of conservation, common ones use the frequency of a residue at a particular site.
Entropy scores are generated. A lower entropy score indicates a less conserved region.
What is PSI-BLAST and how is it used?
First constructs a multiple alignment of all the related sequences identified by BLAST
Then estimates the residue frequencies at each position to construct a score matrix: Position Specific Substitution Matrices (PSSM) also known as weight matrices or 1D profile
Then uses the 1D-profile for scanning the database and
aligns matched sequences and builds profile
Use the Multiple Alignment to Calculate Residue Frequencies
The residue frequencies at each position are used to calculate the scores for aligning a query sequence against the pattern
How are profile based sequence search methods used?
By comparing related sequences within a protein family can identify patterns of conserved residues.
Even the most distant members of the family should have these patterns of conserved residues.
Can make a profile which encapsulates these patterns and use it to detect more distantly related sequences.
Highly conserved positions usually correspond to residues important for the folding or packing in the buried core or functional residues within the active site
What are some features of the RefSeq database?
RefSeq (Reference Sequence Database), curated by NCBI, provides:
Genomic context, gene structure (exons/introns)
Corresponding mRNA, protein, and genome assembly
Integrated information with NCBI Gene, Conserved Domain Database (CDD), and ClinVar
Emphasis on model organisms and standard reference transcripts
What are some features of the UniProt database?
Protein function, subcellular location, domains, motifs
Post-translational modifications (PTMs)
Isoforms and splice variants
Natural variants and links to disease
Cross-references to PDB, Pfam, Gene Ontology (GO), Reactome, and expression data
Experimental evidence vs computational prediction