Lecture 5&6: Protein Sequence Alignment Flashcards

Question 1

Q

What are Orthologs?

Answer

A

Copies resulting from a gene duplication event that come from speciation.

Question 2

Q

What are Paralogs?

Answer

A

Copies resulting from a gene duplication event within the same organism.

Question 3

Q

What is the formula for percentage sequence identity?

Answer

A

Number of identical residues/number of residues in smallest protein) * 100

Question 4

Q

What are the general steps to solve the function of a protein?

Answer

A

Do fast scans using approximate methods. (BLAST)
Align proteins using a dynamic programming method (Needleman & Wunsch, Smith & Waterman)
Scan against sequence profiles or HMMs in secondary databases (Pfam, InterPro)
Align sequence against family relatives using ClustalW, Jalview

Question 5

Q

What is the difference between Needleman & Wunsh and Smith & Waterman algorithms?

Answer

A

Needleman & Wunsch uses Global Alignment

Smith and Waterman uses Local Alignment

Question 6

Q

What are some rules of sequence homology?

Answer

A

Protein pairs having more than 150 residues are homologs if they have a sequence identity > 25%

For shorter fragment proteins, 30% sequence identity is required.

Structure within families tends to be much more conserved compared to sequence.

Inheriting functional properties from a homolog requires around 60% sequence identity.

Question 7

Q

What are the different matrices used when comparing to proteins?

Answer

A

Identity Matrix (Binary)

Physicochemical properties matrix (range)

Evolutionary matrices (Dayhoff, BLOSUM matrices)

Question 8

Q

What is the Dayhoff matrix?

Answer

A

It is an evolutionary matrix.

It measures the evolutionary distance by determining the number of point accepted mutations, where 1 PAM = 1 point mutation/100 residues

if more than 100PAM, it means multiple substitutions have occurred to the same site.

Question 9

Q

What is the BLOSUM matrix?

Answer

A

It is an evolutionary matrix.

It is derived from analyzing substitution patterns in more distant relatives.

Question 10

Q

What is the difference between a p-value and an e-value?

Answer

A

the p-value is the likelihood that this match was obtained by chance, which is converted to an e-value, which takes into consideration the size of the database.

Question 11

Q

What types of residues are most conserved?

Answer

A

Catalytic residues are the most highly conserved residues. Others could include residues in the binding pocket, the surface of a protein.

Highly conserved residues are usually associated with the function.

Question 12

Q

What is progressive alignment?

Answer

A

It is a heuristic approach that uses the idea that sequences are evolutionarily related and can be aligned using an underlying phylogenetic tree.

Question 13

Q

What are the features of the Clustal W algorithm?

Answer

A

It has position specific gap opening and extension penalties (higher within strands and helices, lower between them).

It uses two different amino acid substitution matrices: one for close relatives, one for distant.

Question 14

Q

What are some alternatives to Clustal W?

Answer

A

MAFFT
T-Coffee
MUSCLE
JALVIEW

Question 15

Q

How can conservation be measured?

Answer

A

While there are various methods to measure the magnitude of conservation, common ones use the frequency of a residue at a particular site.

Entropy scores are generated. A lower entropy score indicates a less conserved region.

Question 16

Q

What is PSI-BLAST and how is it used?

Answer

Study These Flashcards

A

First constructs a multiple alignment of all the related sequences identified by BLAST

Then estimates the residue frequencies at each position to construct a score matrix: Position Specific Substitution Matrices (PSSM) also known as weight matrices or 1D profile

Then uses the 1D-profile for scanning the database and
aligns matched sequences and builds profile

Use the Multiple Alignment to Calculate Residue Frequencies

The residue frequencies at each position are used to calculate the scores for aligning a query sequence against the pattern

Question 17

Q

How are profile based sequence search methods used?

Answer

Study These Flashcards

A

By comparing related sequences within a protein family can identify patterns of conserved residues.

Even the most distant members of the family should have these patterns of conserved residues.

Can make a profile which encapsulates these patterns and use it to detect more distantly related sequences.

Highly conserved positions usually correspond to residues important for the folding or packing in the buried core or functional residues within the active site

Question 18

Q

What are some features of the RefSeq database?

Answer

Study These Flashcards

A

RefSeq (Reference Sequence Database), curated by NCBI, provides:

Genomic context, gene structure (exons/introns)

Corresponding mRNA, protein, and genome assembly

Integrated information with NCBI Gene, Conserved Domain Database (CDD), and ClinVar

Emphasis on model organisms and standard reference transcripts

Question 19

Q

What are some features of the UniProt database?

Answer

Study These Flashcards

A

Protein function, subcellular location, domains, motifs

Post-translational modifications (PTMs)

Isoforms and splice variants

Natural variants and links to disease

Cross-references to PDB, Pfam, Gene Ontology (GO), Reactome, and expression data

Experimental evidence vs computational prediction

Lecture 5&6: Protein Sequence Alignment Flashcards

(19 cards)