Protein structure, function, and evolution Flashcards
(25 cards)
Databases and the
current sample space
Sequence sample space huge- UniProtKB has over 330mln seqs, allows accurate and detailed comparison and analysis. Average proteins are 25-35kDa. Multidomain proteins can have shared domains but different compositions, e.g., intracellular signalling proteins. Extreme multidomain protein e.g.= Titin- longest in human genome, >300 domains, spans half of muscle sarcomere length.
Amino acids and primary structure
Amino acids are left-handed, so when looking along the H-Calpha axis the groups spell CORN. In the main chain, the CO is a H bond acceptor, the NH a donor, and the main chain is restricted by peptide bond planarity and chirality. The main chain is uncharged but can H-bond in 2o structures.
Peptide bonds: C-O and C-N bonds have partial double bond structure (resonance), forcing planarity over these bonds. Peptide bonds usually in trans, as cis would cause sidechain steric clash. Proline has similar steric penalty for both cis and trans due to its cyclised sidechain, so can sometimes be found in trans.
Ramachandran plots describe phi/psi angle combinations allowed by extent of rotational freedom. Can be used to check structures produced computationally. Gly has no sidechain so has much freer plot with symmetry (non-chiral). Proline and residues before it are much more restricted due to sidechain being cyclised to main chain.
Obtaining a primary sequence and analysis in silico
Obtaining a 1o seq: Edman degradation: low throughput, slow way of obtaining 1o sequence, still used for identification of proteins (by searching partial seqs in database) but not for complete sequences mass spec now preferred for seq. N-terminal seq using Edman degradation originally by Sanger. Most seqs now derived from (c)DNA, as DNA seq faster and cheaper
Analysing a 1o seq in silico: ID features similar to other proteins like homologous domains/proteins, similar functional sites (ligand binding, catalytic), secondary structure predictions. Databases/ BLAST, motif and domain composition analysis used iteratively to obtain many alignments with other sequences.
Seqs are compared computationally, considering conserved patterns, possibility of indels in one/both seqs, significance of alignment (statistical analysis needed for this). Similarities are quantified with a substitution matrix like Blosum62, though different matrices can be used depending on expected similarity. Matrices tend to weigh rare/special residue matches/mismatches highest/lowest. Matrices improve on alignment scores by taking into account conserved changes, improving significance and alignment quality.
Predicting structure and function (include definitions of orthologs and paralogs)
Predicting structure+ function: seqs conserved at 40%+ can have function and fold predicted reliably, and fold for enzymes can be predicted at 25%+ conservation- notably, low sequence alignment can still have structural and functional similarity (e.g., haemoglobin alpha chain, myoglobin+ leghaemoglobin) Context of a short seq can also impact its conformation. Info from the seq, patterning of residues, aa propensities, conservation of related seqs/structures help predict secondary structure. Signal peptides, transmembrane segments and intrinsically disordered seqs are generally easy to spot in a seq.
Orthologs= homologous proteins that perform similar f(x) in different species.
Paralogs= homologous proteins that perform different f(x) in same species.
AlphaFold2 predicts individual domains with high accuracy (less so for complexes/multidomain) w/ quality comparable to experimental structures but limited info on dynamics+ conformations. Can ID Intrinsically but often produces meaningless higher order structures. Still required experimental validation.
Uses database to search for homologs, analyses multiple seq alignment to create contact map, improves analysis and iteratively refined predicted 3D model.
AlphaFold3 arrived in early 2024, uses atom-based language rather than protein-only, can model non-protein parts (post-translational mods, DNA/RNA, small molecules), recently open source. Some modelling, especially of intrinsically disordered parts, has gotten worse.
secondary structure: alpha helices
Alpha helices: H bonding between residues i+(i+4) in backbone (short-range interactions). Helical pitch=3.6 residues x 1.5A per residue=5.4A. directionality of helix and C=Os gives overall dipole moment from partially -ve to +ve. Often capped by -ve residues at N termini and sometimes by +ve residues at C termini which neutralise the dipole moment. Sidechains project in all directions, oriented toward the N terminus (Xmas tree).
Heptad repeats: 7 residue repeat where every 4th is hydrophobic (often aliphatic) forms sticky, hydrophobic surface on one side of helix, creates interaction surface within a protein+ between proteins (e.g., leu zipper)
Rossman fold: nucleotide binding facilitated by helical dipole moment (+ve end interacts w/Pi in nucleotide)
beta strands, 3^10 helices and polyproline helices
Beta-strands: alternating H-bond acceptors and donors to neighbouring strands, parallel or anti=parallel orientation with respect to neighbours, pleated backbone conformation (often show significant twisting+ bending), sidechains project alternatively to different sides. Alternating pattern of polar+ apolar residues. Drawn schematically as arrows.
Beta turns: short, tight, reverse direction of main chain. Typically, H bonding i+(i+3)
Beta loops: longer loops w/variable structure. Connect the secondary structure elements
310 helices: tight, w/distorted H-bonds. Typically, only short segments
Polyproline helices: no H-bonds by main chain (free to interact w/other residues), usually in proline-rich seqs (stabilisation of conformation), pseudo-symmetrical. E.g., collagen- intertwined triple helix, H-bonds between strands.
Super-secondary structures
Super-secondary structures: alpha-alpha hairpins (-> four-helical bundle), beta-alpha-beta (+alpha-beta makes Rossman fold), beta-beta hairpin (-> beta meander (3beta) or Greek key (4beta))
Globular domain: basic folded evolutionary unit in many different classes (all alpha, all beta, mix…) can often be isolated from large protein by limited proteolysis (connecting seqs more accessible to proteases), typically defined by a molecular function.
Hydrophobic core of a domain: polar sidechains point to solvent (H-bonding needs to be satisfied if buried) and apolar face into core. Beta-sandwich has 2 polar slabs surrounding apolar centre.
Forces involved in tertiary structure formation
Forces: Covalent bonds in backbone, non-covalent interactions between residues (H-bonds)
* disulfide links (require oxidising env. 2Cys=cystine. E.g., tgf-beta family growth factor has Cysteine knot structure and further disulfide linking dimer)
* Metal co-factors e.g. Zn fingers coordinated by conserved Cys+his residues, Zn2+ play structural role
* Co-factors e.g., haem in myoglobin. Part of globular structure (deep cleft filled w/co-factor)
* Post-translational modifications
Helical domains include myoglobin, 2 spectrin repeats (3 helix bundles, can form head-to-tail structures), Erythropoietin 4 helix bundle. Alpha/beta proteins typically have central beta sheets surrounded by helices.
All-beta proteins include beta-barrels like in GFP, beta sandwich C2 domain w/Ca2+ and phosphatidyl Serine.
Multidomain tertiary structures include many kinases like Src kinase.
Tertiary structure: structural repeats, functional domains and 3D structures
Structural repeats: small seq repeats always found in several copies (variable #). Include linear repeats (leucine-rich, ankyrin, RING and HEAT repeats) and closed structures (beta propellers). Leucine rich repeat (LRR) proteins form alpha/beta horseshoe folds, most often in extracellular receptors, w/20-30aa repeats, variable curvature and often capped @ both ends by disulfide linked domains.
Functional motifs shared by proteins: e.g., Rossman fold in many NAD+ binding enzymes (lactate/alcohol dehydrogenases), P-loops often in ATP binding proteins (protein kinase, actin)
3D structures reveal distant relationships: e.g., conserved core of type II restriction enzymes including an active site with a conserved Asp coordinating Mg+. Whole enzymes have divergent overall structures but conserved order of secondary structures. Structure tends to be more conserved than function (actin vs Hsp70 both bind ATP so have 1 similar fold but overall have different functions), but convergent evolution can lead to different structures having similar functions (e.g., chymotrypsin and subtilisin both serine proteases, using same residues for catalysis but overall v different structures)
Protein dynamics
Rotation of main chain bonds (Ramachandran) and sidechains (adopt multiple rotamers), concerted motions leading to larger movements, ligand induced changes, local and global unfolding can be investigated with dedicated experiments. Internal dynamics are on ps scale while overall tumbling on ns scale. Crystal structures give snapshots of certain forms from which movement is implied. NMR and molecular dynamics simulations needed to fully understand movement.
e.g., Conformational change upon ligand binding: adenylate kinase active site undergoes large movement on substrate binding. Residues at hinge change conformation, moving apo-enzyme to liganded enzyme, flapping the “lid” open.
Myoglobin and haemoglobin, cooperativity and the Bohr effect
Myoglobin vs Haemoglobin: conserved globin fold (myoglobin monomer and alpha+ beta chains of Hb share ancestry. Hb haems well separated). Haem accessory group binds O2, changing position of iron in porphyrin ring. Myoglobin has much higher affinity, ½ saturating at 2 Torr while Hb ½ saturating at 26 Torr (saturation curve sigmoidal, suggesting binding cooperativity).
Binding cooperativity: allows for saturated binding and efficient release of O2 so Hb binds efficiently and releases gradually in tissues (Hence myoglobin couldn’t release O2 to tissues w/intermediate O2 partial pressure). Hb tense (T) state constrained by subunit interactions, less able to bind O2, while relaxed (R) state less constrained. Oxygen binding moved Fe in Haem, pulls alpha helix carrying proximal His, affecting alphabeta-alphabeta interface. Models for T/R transition (reality somewhere in between):
Concerted: Hb w/3O mostly R, binds O w/20x affinity of deoxy-Hb
Sequential: Hb w/1O mostly T but binds O w/ 3x affinity of deoxy-Hb.
2,3-BPG (2,3-bisphosphoglycerate) in red blood cells at similar concentration to Hb, binds between beta subunits, stabilising T state, lowering O affinity, allowing it to function (pure Hb wouldn’t). Mutation in His143 to Ser in foetal beta chain reduced 2,3-BPG affinity so O2 flows from maternal oxy-Hb to foetal deoxy-Hb
Bohr+CO2 effect: rapidly metabolising tissue-> H++CO2-> lower pH, reducing Hb O affinity (faster release). In deoxy-Hb, His 146 of beta subunit salt-bridges to Asp94 in T state. Lower pH increases protonation, encourages salt bridge, stabilising T state. CO2 also directly interacts w/Hb terminal amino groups to stabilise T state (allostery).
Oligomeric proteins: open homomers
Homomers: very abundant, fall into closed or open symmetry classes. Both classes unstable at low concentrations+ have limited diffusion. May not want if don’t want cooperativity.
Open: can polymerise infinitely, energetically less favourable+ more specialised. Form filaments. Dynamic+ inherently heterogeneous-> limited structural data. Can have structural (tubulin, actin) or catalytic (RadA recombinases) role. Length can be controlled by additional proteins capping ends, rate of polymerisation, co-factors providing molecular rulers. Typically helical. Can be hard to study w/Xray (if helical symmetry doesn’t match crystallographic symmetry, can’t crystalise) so usually use Cryo-EM.
Actin polar (all protomers assemble same direction), self-assembling (point end slow growing while barbed end fast growing). Kd for a filament= critical concentration above which it grows+ below which it depolymerises- ATO actin grows 20x more readily, ADP actin depolymerises. In cells, stability+ length controlled by proteins capping, severing or crosslinking filaments or binding G-actin to inhibit growth. (G)lobular actin monomers-> (F)ilamentous oligomer. ATP binding sites regulate polymerisation+ stability. No high-res crystal structure but EM and fibre diffraction structures.
Myosin- limited proteolysis can cut off heavy heads. Relay helix changes conformation slightly, repositioning lever arm when bound/unbound to ADP-VO4(3-), contracting muscle.
Oligomeric proteins: closed homomers, genetic economy
Closed: e.g., virus capsid more common, finite in space, most have cyclic/dihedral (rotational symmetry axis and perpendicular axis of 2-fold symmetry) symmetry. Rotation produces identical structure. Specificity of protein-protein interfaces favours symmetrical complexes, with max # of subunit interactions in closed complex. Protomer= asymmetric subunit of oligomeric complex.
Closed symmetry oligomers may not be perfectly symmetrical, and symmetry can break down for f(x), e.g., homo-hexameric T7 helicase has breakdown of symmetry for progressive unwinding- strand pushed through rotating mechanism (each subunit has distinct, sequential role, breaking down symmetry)
Genetic economy: Homodimeric HIV protease- aspartic protease= homodimer w/99aa protomers more economic genome-wise to have homodimer that completes active site by dimerization that loses symmetry upon binding asymmetric substrate, compared to monomeric aspartase protease like renin. (NB a good drug target). Another e.g.,=antibodies- all domains have similar beta sandwich Ig fold.
Oligomeric proteins: heteromers, subunits in isolation and pseudo-symmetry
Heteromers: more variation, no need for closed structures as surfaces non-equivalent. E.g., 1:2 hGH complex w/its receptor: identical receptor chains bind monomeric hormone (non-equivalent binding sites)
Pseudo-symmetry= structure apparently symmetrical, parts not identical e.g., archaeal proteasome.
Oligomer units often unstable in isolation: expression of individual components controlled to minimise waste+ accumulation of aggregated protein. Humans have 4 alpha-Hn encoding genes and only 2 for beta, so use alpha Hb stabilising protein (AHSP) specifically binding new alpha in the same site as the beta chain, stabilising it until another beta is available.
Membrane proteins
Mostly receptors, transporters, channels and enzymes. Basic principles of protein structure apply except at lipid interface. Both helices+ sheets can span membrane, interact w/ lipid via apolar sidechains. Lone beta strands not found in transmembrane segments as H-bonds in membrane can’t be satisfied. Single span helices often found in signalling receptors.
Beta-barrels/porins: continuous beta sheet forming barrel, similar topology to soluble barrels (GFP). Hydrophobic residues alternate (inside hydrophilic to transport molecules, outside more apolar). Gasdermin= large pore-forming complex of 33 subunits, has both sheets and helices.
Ion channels often have small channel in the middle of an oligomer (K+ tetrameric, e.g.) or can be 1 polypeptide. Selectivity filters can be by charge+ size.
Partially membrane embedded proteins e.g., prostaglandin synthase, quite difficult to study.
X-ray crystallography: purpose and overview
Analyse seq (domains, secondary structure)-> clone (often need multiple constructs and fusion partners/tags and always need strong promoter. Construct design mostly empirical)-> express in recombinant host-> purify protein-> crystalise-> collect diffraction data-> solve phase problem, fit model to electron density-> final structure. Came about in mid-20th century. Doesn’t show dynamics but high (up to~0.6A) resolution. Higher resolutions even enable visualisation of network of water molecules around the protein.
X-ray crystallography: protein purification. circular dichroism spectroscopy, protein stability and UV absorbance
Purification: Different orthological techniques maximise chance of purifying protein fully, with at least 2-3 steps used: By affinity tags,
* using hydrophobic surfaces on the protein in a hydrophobic interaction column,
* By charge in ion exchange chromatography (charge can be affected by post-translational mods like Pi/Acetylation and charged residues. Ion exchange chromatography uses ion exchanger beads. Sampe is applied, washed and eluted with salt concentrate),
* By size and shape in size exclusion chromatography AKA gel filtration- diluting method (small sample applied, 5-10x as large when coming out of long SEC columns). Proteins of known molecular weight run first and elution volume graphed. Void volume (Vo) measured w/ large, coloured carbohydrate like blue Dextran, total column volume (Vc) w/small organic compound like acetone. Calibration curve produced. Graph plotted with log(Mr/kDa) on x and Kav (=(Ve-Vo)/(Vc/Vo)), where Ve= elution volume.
Can analysed purity, homogeneity, activity, structure, stability before proceeding.
Circular dichroism Spectroscopy measures difference in absorption of left and right handed circularly polarised light created by 2 perpendicular polarised beams which differ in phase by 90o (Al-Ar=DeltaA). Peptide bonds of chiral aas show conformations dependent on delta, which can help assess 2o structure.
Generally, CD signal at 215nm shows sheet content, while 208 and 222nm signals used to find helical content. Spectrum is a sum of sheet, helical and random coil components. Results are expressed as molar ellipticity (o/cm2/dmol1). Accurate concentration of protein in sample needed to estimate 2o structure.
Protein stability: 3o structure depends on many weak interactions, some internal to protein and some w/env. Ree energy released when interactions form on folding, balanced by entropy loss. Typical protein free energy difference folded vs unfolded ~5-15kCal, show sharp transition from unfolded-> folded. Denaturation w/chemicals/heat in stepwise increase can be used to analyse stability, measuring some structural feature (like CD signal) at each step. Can similarly follow ligand binding, etc.
Protein UV absorbance- Trp at 280nm, peptide bond at 214nm, disulfide bonds in UV range, helps determine concentration of pure protein. Metal ions and prosthetic groups often absorb in visible/UV range (haem (haemoglobin)-red, haem in cytochrome C- orange, Cu/plastocyanin- cyan etc). Calculated absorption coefficient of protein at 280nm= #tyrx1250+ #trpx5680+ #1/2cystinesx60. Also helps assess purity:
Absorbance ratio at 260:280 for protein=0.6, DNA=1.8, RNA=2.0ish.
X-ray crystallography: crystallization (vapour diffusion)
Crystallization: Pure protein mixed w/ligands/co-factors (may affect crystallization, so not always useful/possible), screened for initial crystal hits (96 wells, nL volumes, screen for chemicals that promote crystallization like salts/small organics), inspect screens microscopically for crystals, reproduce successful wells and refine conditions, aim for large, single crystals. Note that mobile loops can inhibit crystallization and may have to be removed (so can’t be modelled). Multidomain proteins are also hard to crystalise and often broken up for crystallization (reductionist approach).
Vapour diffusion crystallization: in a sealed chamber with precipitant solution (“mother liquor”- a mix of different salts, PEG or solvent, very refined pH- found by screening as above). Protein mixed with precipitant in a small drop hanging down from the lid of the sealed chamber. Equilibration by diffusion of water over time-> crystallization. Defined form w/sharp edges tends to indicate the crystal will diffract well. Shape can sometimes provide clues as to structure.
X-ray crystallography: getting data and solving the structure
Getting data: single crystals at 100K (to minimise radiation damage- additives prevent ice crystal formation) produce diffraction data by shooting focused+ intense Xray beam from many angles/orientations as crystal rotated (usually 1800 angles)
Merge all data for intensities of diffraction of each individual reflection (diffraction spots can be used to measure each reflection intensity, linked to diffracted X-ray amplitude).
Solve structure: by solving phase problem. To calculate electron density by Fourier synthesis, derive relative phase of each reflection by one of 3 principal methods: introduce heavy metal that changes the phase (replacing atoms with a heavy metal like Au/Hg then subtracting the light atom from heavy atom scattering to get more exact replaced atom positions which can then be used as a reference to place remaining atoms around), use ability of a certain elements to absorb X-rays and change diffraction intensities, e.g., Seleno-methionine can be incorporated during protein expression, or use closely related protein structure to derive calculated phases. Solve structure (phase problem, using closely related structure or alphafold) to get electron density map. Fit atomic model to map, refine computationally and analyse and interpret the structure. Phase problem limits size of protein possible to analyse by Xray diffraction (larger= harder to model) but technically no size limit.
X-ray sources for X-ray crystallography
Xray sources: In-house rotating anode X-ray in larger labs uses single wavelength, collects data in a few hours. Expensive to run and slow, now less common. More modern= synchrotron e.g., Diamond light source synchrotron in Oxfordshire- tuneable wavelength, higher intensity so data collection in 15-20sec, high automation and throughput, works 24/7. Size of synchrotron determines X ray intensity. Electrons accelerate in a small chamber, then spit out into larger circular (donut shaped) tube chamber where magnetic plates deflect electrons. Electron speed and magnet strength modulate wavelengths.
NMR In 1 and 2D
Labelling with N15, C13 or H1 in recombinant protein production. Requires pure, stable, homogeneous solution, can show protein dynamics, structural changes, folding, and info about structure before complete structure solved. Sample prepped by cloning and expression with isotope labelling followed by purification. 1D+ 2D experiments with HNMR take seconds to a day, for low res 2o/3o structure optimisation and dynamics. 2D experiments with N15 (10-60 min) for mono/polydispersity optimisation and dynamics. 3D N15 and C13 experiments (4 hours-1 week) for assignment and chemical mapping. Then further optimisation and structural work (3 months- 1 year). Atomic resolution, good for proteins under 50kDa+ dynamics/ligands. ~early 2000s.
1D spectra can show methyl groups, aromatic CHs and NHs (characterise sidechains), can be diagnostic of folding. Folded state samples predominantly in 1 main conformation (chemical shift can inform 2/3o structure). while unfolded has statistical distribution of rapidly interconverting conformations (chemical shift averaged and random coil)
2D spectra w/ H1 and N15 (HSQC) helps correlates Hs with the Ns they are bound to and determine folding state. E.g., 51-residue protein in unfolded state will show 51 (/correct # for 51-residue protein) peaks and poor shift dispersion (single conformation), folded state will increase H shift dispersion and have >51 peaks (multiple conformations), and a folded protein with 1 conformation will have a wide shift dispersion and 51 peaks. 3D and higher dimensional spectra similarly possible and informative.
NMR: NOE, ligand binding dynamics and intrinsically disordered proteins/domains
Nuclear Overhauser effect (NOE): transfer of magnetisation from 1 nucleus to another through space (not through bonds). Can be detected on NOESY spectra as cross peaks between nuclei- gives info on 3D structure+ which atoms are close in space, provides constraints for molecular models. Computational molecular dynamics calculations satisfy constraints-> ensemble of structures best fitting constraints (poorly defined regions w/few constraints likely mobile), ensemble refined and integrated-> model.
Ligand binding and dynamics can be analysed- shifts report on interactions, can estimate rate and range of motion using relaxation rates (e.g., overall tumbling by NOE, aromatic ring flips by saturation transfer or chemical shift).
Intrinsically disordered proteins/domains- don’t have defined 3D structure, rich in Gly/Pro/charged/polar residues, can’t form hydrophobic core. Structure often assumed on binding other proteins/DNA etc. X-ray and cryo-EM can’t be used, AlphaFold also struggles so NMR best method for study.
Cryo-EM
Sample in solution applied to microscopic grid, stained w/uranyl acetate (-ve stain, high contrast, low res) or frozen w/liquid ethane, water vitrifies allowing imaging w/out stain. Obtain 2D images of a set of particles-> 2D averages-> initial 3D map refined to form final map-> 3D model. Atomic/2-3A resolution, easier now with better microscopes/detectors and freezing conditions, vitreous ice allows imaging of native conformations. Better for larger (>100kDa) proteins+ multimers. Get static views of many conformations-> clues to movement. 1980s.
Best for large objects w/internal symmetry, e.g., regular icosahedral virus. Images of helical filaments show projections from all angles, w/helical parameters derived mathematically, and helical projections used for averaging/minimising noise. Bending of helix presents biggest problem.
Cryo-ET
multiple projection images from relatively thick, vitrified specimen-> back projection reconstruction-> 3D images, e.g., 10 digital slices ~2.5nm thick. Current res 4-5nm- only ~400kDA+ proteins seen, but res <1nm en route. If macromolecule structure available from other techniques, pattern recognition techs can ID copies of a particular molecule in the cell.