Bioinformatik Flashcards

Question

Scoring matrix

Answer 1

A tool to quantify how well a certain model is represented in the alignment of two sequences, and any result obtained by its application is meaningful exclusively in the context of that model. All subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.

Answer 2

Are not performed that much Identity matrix BLAST matrix Transition/Transversion matrix

Answer 3

Mutation that conserves the ring number of the nucleotide

Answer 4

Mutation that does not conserve the ring number of the nucleotide

Answer 5

Used to define the evolutionary distance between two aa by the minimal number of nucleotide changes required. The probability that an observed aa pair is related by chance rather than inheritance should depend on amount of point mutations needed to transform one codon to the other. From the matrix it has been seen that the genetic code appears to have evolved to minimize the effects of point mutations. Mutations often give aa with similar properties.

Answer 6

Side chains consist of nonpolar methyl or methylene-groups. A A usually located on the interior of the protein because of their hydrophobicity. All except alanine are bifurcated. For Val and Ile the bifurcation is close to main chain and can therefore restrict the conformation of the polypeptide by steric hindrance.

Answer 7

Only phenylalanine is totally non-polar. Tyrosine’s phenolic side chain has a hydroxyl substituent and tryptophan has a nitrogen atom in its indole ring system. These residues are almost always found largely buried in hydrophobic interior of proteins which are normally predominantly non-polar naturally. But, polar atoms of tyrosine and tryptophan allow hydrogen bonding interaction with other residues or even solvent molecules.

Answer 8

Neither strongly charged nor nonpolar. They are intermediate in polarity and are typically hydrophilic, meaning they interact favorably with water. ---- Small aliphatic side chains with polar groups that cannot ionize readily. Serine and threonine possess hydroxyl groups in their side chains and as these polar groups are close to the main chain they can form hydrogen bonds with it. This can influence the local conformation of the polypeptide. Residues such as serine and asparagine are known to adopt conformations which most other amino acids cannot. The amino acids asparagine and glutamine posses amide groups in their side chains which are usually hydrogen-bonded whenever they occur in the interior of a protein. Substitution ser <-> thr most common in nature.

Answer 9

Aspartate and glutamate have carboxyl side chains and are therefore negatively charged at physiological pH. Strong polar nature of the residues means they are often found on the surface of globular proteins - able to interact with solvent molecules. Residues can also partake in electrostatic interactions with positively charged basic aa. Aspartate and glutamate can also take on catalytic roles in the active site of enzymes, well known for their metal ion binding abilities.

Answer 10

Histidine has the lowest pKa (around 6) - neutral at around physiological pH. Occurs often in enzyme active sites as it can function as a very efficient general acid-base catalyst. Also acts as metal ion ligand in many cases. Lysine and arginine are more strongly basic, + at physiological pH. Generally solvated but occasionally occur inside proteins involved with electrostatic interactions with - groups. Lys and Arg are important for anion-binding proteins because able to interact electrostatically with ligand.

Answer 11

Glycine and proline - unique, appear to influence conformation of the polypeptide. Gly lacks a side chain and is very flexible in conformation. Occurs abundantly in certain fibrous proteins because of its flexibility and since small size allows adjacent polypeptide chains to pack together closely. Proline on the other hand is the most rigid aa because the side chain is covalently linked with main chain nitrogen.

Answer 12

If you want to predict which part of a protein is going through a membrane. An attempt to quantify some physical or chemical attribute of the residues and assign weights based on similarities of the residues in this chosen property

Answer 13

A family of matrices that scores aa pairs on the basis of the expected frequency of substitutions of one aa for the other during protein evolution.

Answer 14

Percent accepted mutation, one accepted point mutation on the path between two sequences per 100 residues

Answer 15

1. Find accepted mutations 2. Frequencies of occurrence 3. Relative mutabilities 4. Mutation probability matrix 5. The evolutionary distance 6. Relatedness odds 7. Log-odds matrix

Answer 16

Size Shape Local concentrations of electric charge van der Waals surface Ability to form salt bridges Hydrophobic interactions Hydrogen bonds

Answer 17

*Chance that a certain residue may have mutated, then reverted, hiding the effect of the mutation *Specific residues may have mutated more than once → number of mutations likely to be larger than the number of differences between the two sequences.

Answer 18

When the PAM distance value between two distantly related proteins nears the value 250 it becomes difficult to tell whether the two proteins are homologous, or if they are two randomly taken proteins that can be aligned by chance.

Answer 19

Closely related sequences. High scores for identity and low scores for substitutions, closer to the identity matrix.

Answer 20

Distant sequences. At PAM200 all information is degenerate except for cysteins.

Answer 21

*Many sequences depart from average composition. *Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements observed!) *Errors in 1PAM are magnified in the extrapolation to 250PAM. *Distantly related sequences usually have islands (blocks) of conserved residues → Replacement is not equally probable over entire sequence.

Answer 22

Blocks substitution matrix. Scores aa pairs based on frequency of aa substitutions in aligned sequence motifs called blocks that are found in protein families. Comes to the same conclusion as PAM.

Answer 23

A. Observed pairs B. Expected pairs C. Summary (A/B) High BLOSUM: Closely related sequences Low BLOSUM: Distant sequences BLOSUM45 <-> PAM250 BLOSUM62 <->PAM160. Blosum62 is the most popular matrix.

Answer 24

High BLOSUM: Closely related sequences

Answer 25

Distant sequences

Answer 26

No single matrix is the complete answer for all sequence comparisons. It is probably best to compliment the BLOSUM62 matrix with comparisons using 250PAMs and Overington structurally derived matrices.

Answer 27

Graphical representation using two orthogonal axes and “dots” for regions of similarity. In a bioinformatics context two sequence are used on the axes and dots are plotted when a given threshold is met in a given window. Dot plotting is the best way to see all of the structures in common between two sequences or to visualize all of the repeated or inverted structures in one sequence.

Answer 28

Nucleic acids: 1 of 4 bases will match at random. Removing self alignments will reduce noise. Stringency: Window size is considered, percentage of bases matching in the window is set as threshold.

Answer 29

Can be global or local. Local alignment look at a portion that align optimally, while global alignment looks at everything (and we are allowed to make gaps to make it fit). Works for basically every sequence. However, cannot run multiple. Is not scalable in size and numbers of sequences. Global: Sequences are completely aligned Local: Only the best sub-regions are aligned. BLAST uses this

Answer 30

Method or a process followed to solve a problem. A recipe. An algorithm takes the input to a problem (function) and transforms it to the output. A mapping of input to output. A problem can have many algorithms.

Answer 31

A process of aligning multiple sequences of nucleic acids or proteins to identify similarities and differences among them. The sequences being aligned can be DNA, RNA, or proteins, and they may come from different organisms. The goal of multiple sequence alignment is to identify conserved regions among the sequences, which can provide insight into their evolutionary relationships and functional significance. If we have more than 2 sequences. 3D matrices formed. Will use more computational power.

Answer 32

How to find gcd(a,b) - the greatest common divisor of a and b. Based on a single observation. if a = b q + r, then any divisor of a and b is also a divisor of r and any divisor of b and r is also a divisor of a, so gcd(a,b) = gcd(b,r) Use the division algorithm repeatedly to reduce the problem to one you can solve. Example: gcd(55,35) 55 = 35*1 + 20 so gcd(55,35) = gcd(35,20) 35 = 20*1 + 15 so gcd(35,20) = gcd(20,15) 20 = 15*1 + 5 done gcd(55,35) = 5

Answer 33

One of the most simple sorting algorithms proceeds by walking down the list, comparing adjacent elements and swapping them if they are in the wrong order. The process is continued until the list is sorted.

Answer 34

1.It must be correct: Compute the correct function 2.It must be composed of a series of concrete steps: Steps executable by the machine in question 3.There can be no ambiguity as to which step will be performed next 4.It must be composed of a finite number of steps 5.It must terminate

Answer 35

The best alignment is the one with the maximum total score

Answer 36

Reduce the problem: The solution to a large problem is to simplify… if we first know the solution to a smaller problem that is a subset of the larger problem. Make a big problem into a small problem. What is the optimal next character instead of what is optimal whole sequence, then combine at last.

Answer 37

Compare two sequences, filling the score matrix from top to bottom left to right. One line at a time.

Answer 38

Sensitivity: ability to find true positives Specificity: ability to minimize false positives There is always a trade-off, you cannot have both 100% sensitivity and specificity

Answer 39

Alignment between parts of the two sequences. With a global alignment we will have many matches in the high similarity section and a lot of mismatches and gaps outside this region. Therefore it makes sense to find the best local alignment instead.

Answer 40

Most practical and widely used: Hierarchical extensions of pairwise alignment methods. Works by principle that multiple alignments are achieved by successive application of pairwise methods.

Answer 41

General purpose multiple alignment program for DNA or proteins. Improves the sensitivity of progressive sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.

Answer 42

BWT is very compact, about ½ byte/base. Can fit onto a std computer with 2Gb of memory. Linear-time search algorithm. Indexing of the database is the basis of the technique. Algorithm then searches the index, which goes much fater than doing millions of pairwise alignments.

Answer 43

First to the left, input sequence: Dummy character (dollar sign)Dummy character does not occur in the sequence. Used to keep track of rotation €acaacg Secondly, alphabetical sorting; dummy character will always be first when doing this. Two operations; rotation and sorting. Make all possible rotations: “acaacg€ if ac is put to the end we will get aacg€ac, can put 1,2,3,4 or 5 characters in the end, all possible rotations” then these are sorted. When sorted the € will form different placements. Sorting will give us interesting properties in the outcome. We are sorting characters depending on their context https://www.youtube.com/watch?v=gqM3j2IRQH4

Answer 44

T-ranking is a method of ranking the positions of a character within a string. It involves assigning a rank to each character based on its position in the sorted order of all the characters in the string. The T-ranking of a character in the BWT can be used to efficiently locate the character in the original string, which can be useful in various string search and pattern matching tasks.

Answer 45

The i-th occurence of character c in L(last column) and i-th occurence of character c in F(first column) corrrespond to the same occurence in T.

Answer 46

SAM files are a type of text file format that contains the alignment information of various sequences that are mapped against reference sequences. These files can also contain unmapped sequences. Since SAM files are a text file format, they are more readable by humans

Answer 47

BAM files contain the same information as SAM files, except they are in binary file format which is not readable by humans. On the other hand, BAM files are smaller and more efficient for software to work with than SAM files, saving time and reducing costs of computation and storage. Alignment data is almost always stored in BAM files and most software that analyzes aligned reads expects to ingest data in BAM format.

Answer 48

The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information. The alignment section contains the information for each sequence about where/how it aligns to the reference genome. Each alignment has: *query name, QNAME (SAM)/read_name (BAM). It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments. *bitwise set of information describing the alignment, FLAG. Provides the following information: -are there multiple fragments? -are all fragments properly aligned? -is this fragment unmapped? -is the next fragment unmapped? -is this query the reverse strand? -is the next fragment the reverse strand? -is this the 1st fragment? -is this the last fragment? -is this a secondary alignment? -did this read fail quality controls? -is this read a PCR or optical duplicate?

Answer 49

The sequence being aligned to a reference may have additional bases that are not in the reference or may be missing bases that are in the reference. The CIGAR string is a sequence of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference. https://genome.sph.umich.edu/wiki/SAM

Answer 50

In fasta, a hit is something similar in the database to the query. Similar: Short stretch of sequence is shared. Different definitions of the stretch.

Answer 51

*For proteins, similar seq does not have to share identical residues. *For nucleic acids due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not.

Answer 52

BLAST searches a large target set of sequences for hits to a query seq and return the alignments and scores from those hits. This process is done fast. BLAST programs are designed for fast database searching with minmal sacrifice of sensitivei to distant related sequences.

Answer 53

Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T Key concept “Neigbourhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly Calculate neigborhood (T) for substrings of query (size W)

Answer 54

Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. High T = Everything has to be very similar, very specific but not very sensitive. Low T = more sensitive but less specific. Typically start with high T and lower it as you move forward. Choosing a value for w small w: many matches to expand big w: many words to be generated w=4 is a good compromise Lowering the segment extension cutoff (S) returns longer extensions for each hit. Changing the minimum E-value changes the threshold for reporting a hit.

Answer 55

The proper value of T depends on both the values in the scoring matrix and balance between speed and sensitivity Higher values of T progressively remove more word hits and reduce the search space. Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes increase sensitivity and decrease speed. The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast. For protein w=3 is the most common.

Answer 56

Doing Blast is doing an experiment. A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T. This allows Blast to rank matchin sequences in order of “significance” and to cut off listings at a user-specified probability. The background distribution of scores must be turned into p-values. For example, the chance of seeing a score of 200, what is the chance given the background distribution? When value goes higher, the p-value will become lower and lower.

Answer 57

The Erdös-Renyi model, also known as the random graph model, is a statistical model for generating random graphs with a given number of nodes and edges. It is based on the idea of randomly connecting nodes with a certain probability, resulting in a graph that exhibits certain probabilistic properties. Example; p is probability of "head" when tossing a coin. p=0.5 For n throws, expected length R of the longest run of heads is: R = log1/p(n) Want to model aa seq alignment as coin tosses

Answer 58

A set of mathematical formulas that are used to evaluate the statistical significance of sequence alignments obtained through the use of heuristics. Are used to calculate the probability that an alignment occurred by chance, allowing researchers to determine the likelihood that the alignment is biologically meaningful. Widely used in bioinformatics to assess the reliability of sequence alignments and to help identify significant matches in large databases.

Answer 59

Probability that alignment is no better than random . P=100E-100 perfect match P>10E-1 match probably insignificant

Answer 60

Expected amount of seq that give the same Z- valueor better if database is probed with random seq. E = multiply P with size of database probed

Answer 61

a measure of the statistical significance of a particular match between a query sequence and a database of sequences. The z-score is calculated based on the alignment score and the distribution of scores for a large number of random alignments. A higher z-score indicates a more statistically significant match, and a z-score threshold can be used to determine which matches are considered significant and should be reported. Z-scores are commonly used in bioinformatics to evaluate the statistical significance of sequence alignments obtained through database searches.

Answer 62

BLAST's major advantage is its speed. 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank. When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. Since it doesn't require a perfect sequence match in the first stage of the search.

Answer 63

The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity. For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value. FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether. In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.

Answer 64

Position Specific Iterated Blast. The best algorithm to find distantly related sequences.

Answer 65

For each position in the derived pattern, every amino acid is assigned a score. (1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores. (2) Weakly conserved positions: all residues receive scores near zero. (3) Position-specific scores can also be assigned to potential insertions and deletions.

Answer 66

Avoid too close sequences: overfit! Want to compromise between PSSM and overfitting. Do not use PSSM where you suspect to use overfitting you instead use normal score matrix - where you don’t need to be position specific. Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. If you look for a family in which not that much is known, risk that you put too much emphasis in a database which you perhaps should not. The E-value reflects the significance of the match to the previous training set not to the original sequence! Choose carefully your query sequence. Try reverse experiment to certify.

Answer 67

Pattern-Hit Initiated Blast. Look into the database, everything said to be a hit has to have a certain conserved pattern and be homologus. Doing a fasta inside a blast search.

Answer 68

BLAST-Like Alignment Tool. Aligns the input sequence to the Human Genome. Connected to several databases.

Answer 69

-more accurate -500 times faster in mRNA/DNA alignment -50 times faster in protein/protein alignment

Answer 70

Phylogenetic trees are about visualising evolutionary relationships with the purpose to illustrate how a group of objects are related to one another.

Answer 71

Set of species that include all of the species derived from a single common ancestor

Answer 72

Smallest group that is consistently and persistently distinct. Species recognized initially on appearance; individuals of one species look different from the individuals from another. For plant species.

Answer 73

a set of interbreeding or potentially interbreeding individuals that are separated from other species by reproductive barriers. Species are unable to interbreed.

Answer 74

the boundary between reticulate (among interbreeding individuals) and divergent relationships (between lineages with no gene exchange). If a stable gene pool can be maintained.

Answer 75

ability to transmit (and maintain) a (stable) gene pool. Adresses the Anopheles genome topology variations

Answer 76

-solve crimes -test product purity -determine if endangered species have been smuggled or mislabeled -Epidemiologists use phylogenetic methods to understand the development of pandemics, pattterns of disease transmission and developement of antimicrobial resistance or pathogenicity. -Conservation biologists may use the techniques to determine which populations are in greatest need of protection, and other questions of population structure. -Pharmaceutical researchers may use the methods to determine which species are most closely related to other medicinal species, thus perhaps sharing the medicinal qualities

Answer 77

To infer relationships that span the diversity of known life, it is necessary to look at genes conserved through the billions of years of evolutionary divergence. The gene must display an appropriate level of sequence conservation for the divergences of interest. If there is too much change, then the sequences become randomized, and there is a limit to the depth of the divergences that can be accurately inferred. If there is too little change (if the gene is too conserved), then there may be little or no change between the evolutionary branchings of interest, and it will not be possible to infer close (genus or species level) relationships. An example of genes in this category are those that define the ribosomal RNAs (rRNAs). Most prokaryotes have three rRNAs, called the 5S, 16S and 23S rRNA.

Answer 78

Rate of evolution = rate of mutation. Rate of evolution for any macromolecule is approximately constant over time (Neutral Theory of evolution) one amino acid subst. 14.5 My 1.3 10-9 substitutions/nucleotide site/year Proteins evolve at highly different rates, depending on type of genes. The lowest are related to protein turnover (quite conserved) while psuedogenes (typically refers to protein with premature stop, so no full protein is translated, no pressure to keep them)

Answer 79

-Easy to perform -Quick calculation -Fit for sequences having high similarity scores

Answer 80

-Sequences not considered as such -All sites equally treated (do not take differences in substitution rates into account) -Not applicable to distantly divergent sequences

Answer 81

Able to keep mutations as status quo. The bases of all sequences at each site considered separately and the log-likelihood of having these bases are computed for a given topology by using a particular probability model. Log-likelihood is added for all sites, sum of log-likelihood maximized to estimate branch length of the tree. Procedure repeated for all possible topologies, topology showing highest likelihood is chosen as final tree.

Answer 82

need long computation time to construct a tree. You can get a terrible amount of possible trees - model does not work for most problems

Answer 83

Consists of determining the minimum amount of changes (substitutions) required to transform a sequence to its nearest neighbour

Answer 84

Searches for minimum amount of genetic events to infer the most parsimonious tree from a set of sequences. The best tree is the one that requires the least number of substitutions.

Answer 85

-If the evolutionary clock is not constant, the procedure generates results which can be misleading ; -within practical computational limits, this often leads in the generation of tens or more "equally most parsimonious trees" which make it difficult to justify the choice of a particular tree ; -long computation time to construct a tree.

Answer 86

In an unrooted tree the direction of evolution is unknown The root is the hypothesized ancestor of the sequences in the tree The root can either be placed on a branch or at a node You should start by viewing an unrooted tree Many software packages will root trees automatical (e.g. mid-point rooting in NJPlot) Sometimes two trees may look very different but, in fact, differ only in the position of the root This normally involves assumptions… BEWARE!

Answer 87

Bootstrapping is a statistical method that is used to assess the reliability of a phylogenetic tree, which is a tree showing the evolutionary relationships among a group of organisms. The basic idea behind bootstrapping is to create a large number of trees based on different samples of the data used to construct the original tree. To do this you take a random block of the alignment (including gaps and such) and copy it a number of times and add a second block and copy it a number of times as well, and this is continued until this new ”alignment” has same length as the alignment. This process is done N times, and the tree-method is made based on all of these. Based on the thus generated N trees you make a consensus tree. you should choose N to be at least 10x that of the length of the alignment.

Answer 88

A bootstrap value is a measure of how often a particular branch appears in the bootstrap sample. For example, if a particular branch appears in 90% of the trees in the bootstrap sample, its bootstrap value would be 90. There is no simple mapping between bootstrap values and confidence intervals. There is no agreement about what constitutes a ‘good’ bootstrap value (> 70%, > 80%, > 85% ????)

Answer 89

Jack-knifing is very similar to bootstrapping and differs only in the character resampling strategy Jack-knifing is not as widely available or widely used as bootstrapping Tends to produce broadly similar results

Answer 90

This technique resamples half of the sequence sites considered and eliminates the rest. The final sample has half the number of initial number of sites without duplication. Half-jacknife is allmost never done, this is horizontal (wheras bootstrapping is vertical), so you take out some of the sequencing instead of taking parts of the allignments out.

Answer 91

0: Zeroth amino acid composition (proteomics, %cysteine, %glycine). cysteine - cysteine bridges. glycine - spacers, to make functional domains in the proteins 1: Primary This is simply the order of covalent linkages along the polypeptide chain, I.e. the sequence itself 2: Secondary Local organization of the protein backbone: alpha-helix, Beta-strand (which assemble into Beta-sheets) turn and interconnecting loop. 3: Teritary Packing of secondary structure elements into a compact spatial unit Fold or domain – this is the level to which structure is currently possible 4: Quaternary structure Assembly of homo- or heterodimeric protein chains Hard to predict

Answer 92

Able to see the psi and phi angles, go from -180 to +180. Looking at known structures enable us to estimate the angles Nature has a very high expressive alphabet for primary sequences, but due to the nature of the peptide bond, certain angles are observed preferentially.

Answer 93

2ndary structure prediction The method uses a set of empirical rules that consider aa seq of a protein and physical and chemical properties of individual aa. Rules used to predict likelihood that particular aa will be part of an alpha helix, beta sheet or a loop region. Widely used method in protein structure prediction, but is not as accurate as some recent methods. But is still useful to understand basic principles of protein structure and identifying potentially important parts of a protein. Method consists of assigning set of prediction values to a residue, based on statistic analysis of 15 proteins and applying a simple algorithm to those numbers.

Answer 94

A plot, x-axis is length of alignment and y-axis is % identical residues Naturally occurring sequences with >20% sequence identity over 80 or more residues always adopt the same basic structure The line of the plot is Important because it tells us that if the alignment is sufficently long and we have 30% identical residues --> the structures are the same. Remarkably low percentage needed to say that the structure is the same.

Answer 95

Compact folding unit of protein structure, usually associated with a function. Is usually a “fold” in the case of monomeric soluble proteins. Comprises normally only one protein chain. Domains can be shared between different proteins.

Answer 96

Membrane bound receptors A very large number of different domains both to bind their ligand and to activate G proteins. Pharmaceutically the most important class

Answer 97

X-ray crystallography is an experimental technique that exploits the fact that X-rays are diffracted by crystals. X-rays have the proper wavelength (in the Ångström range, ~10-8 cm) to be scattered by the electron cloud of an atom of comparable size. uses protein crystals

Answer 98

NMR uses protein in solution – Can look at the dynamic properties of the protein structure – Can look at the interactions between the protein and ligands, substrates or other proteins – Can look at protein folding – Sample is not damaged in any way – The maximum size of a protein for NMR structure determination is ~30 kDa.This elliminates ~50% of all proteins – High solubility is a requirement

Answer 99

a) Finding a structural homologue b) Extract “template” sequences and align with query c) Input for model building d)Methods e) Model evaluation (How good is the prediction, how much can the algorithm rely/extract on the provided templates)

Answer 100

CASP is a biennial experiment that aims to evaluate and compare the accuracy of different methods for predicting the 3D structure of proteins from their amino acid sequences. During the experiment, participating groups submit predictions for a set of proteins whose structures are not yet known (referred to as "targets"). The structures of these proteins are later determined experimentally and the predictions are evaluated for their accuracy. The results of the CASP experiment provide a benchmark for the current state of the art in protein structure prediction and help researchers identify areas for improvement in their methods.

Answer 101

genome structure gene-organisation known promoter regions known critical amino acid residues.

Answer 102

All cells have "sidechains" or molecules hanging outside of them that recognize specific extracellular chemicals

Answer 103

Cells have receptive substances on them that can be affected by agonist molecules or blocked by antagonist molecules

Answer 104

Enzymes have an active site (LOCK) where substrate (KEY) binds. Enzymes action on the substrate make the key ill-fitting and the product leaves the active site

Answer 105

a large collection of compounds with different chemical properties or shapes, generated either by combinatorial chemistry or some other process or by collecting samples with interesting biological properties.

Answer 106

the automated examination and testing of libraries of synthetic and/or organic compounds and extracts to identify potential drug leads, based on the compound's binding affinity for a target molecule.

Answer 107

conc where 50% of the enzyme activity is inhibited. Activity can be saturated. Need to be sure that you have a single compound binding a single target and not multiple compounds or multiple target. Used to double check that everything made up to this point is correct.

Answer 108

a potential drug candidate emerging from a screening process of a large library of compounds

Answer 109

-Basically affects specifically a biological process. Mechanism of activity (reversible/ irreversible, kinetics) established -Its is effective at a low concentration: usually nanomolar activity -It is not toxic to live cells -It has been shown to have some in vivo activity -It is chemically feasible. Specificity of key compound(s) from each lead series against selected number of receptors/enzymes -Preliminary PK in vivo (rodent) to establish benchmark for in vitro SAR -In vitro PK data good predictor for in vivo activity -Its is of course New and Original.

Answer 110

Poor absorption or permeation is more likely when; 1.There are < 5 H-bond donors (expressed as the sum of OHs and NHs); 2.The MWT < 500; 3.The LogP <5 (or MLogP is < 4.15); 4.There are less than 10 H-bond acceptors (expressed as the sum of Ns and Os)

Answer 111

new concept. Trial were the participants get infected to fast track data acquisition to get the vaccine faster. Of course not for all diseases. Could work if you are young and healthy. Much smaller trial than normal when all participants get infected. With 50/50 CT of patients with and without recieving the drug. If different companies do the same trial they should have shared control arms, unnecessary to let that many people be without the drug?

Answer 112

Personal genomics manifesto

Answer 113

Clusters of conserved residues. Carry out particular function/form particular structure important for conserved protein

Answer 114

For amino acids, a number representing the hydrophobic or hydrophilic properties of its side-chain. The larger the number is, the more hydrophobic the amino acid. The most hydrophobic amino acids are isoleucine (4.5) and valine (4.2). The most hydrophilic ones are arginine (-4.5) and lysine (-3.9). This is very important in protein structure; hydrophobic amino acids tend to be internal in the protein 3D structure, while hydrophilic amino acids are more commonly found towards the protein surface. For Kyte-Dolittle plot, a window size of 19 with peaks >1.8 indicate possible transmembrane region whereas window size 9 indicate possible surface regions of globular proteins.

Bioinformatik Flashcards

(138 cards)