Bioinformatics Exam 3 Review Flashcards

Question

Numerical Objects

Answer 1

integers and decimals

Answer 2

used to group arguments of expressions in conjunction with commas and can be used to control the order of operations in expressions, the expression enclosed within the innermost parentheses will always be evaluated first

Answer 3

commands that evaluate an expression and store it, so that it can be accessed again in the future. These store objects in variables where the expression on the right-hand side can either be an R object or any type of valid expression that creates an R object and the left hand side is a variable that can be understood as a label or name that is attached to an object in order for R to remember it.

Answer 4

an interactive interface for the command line interpreter (it comes up when you open R), commands can be typed into the console and executed by hitting the key

Answer 5

expressions that have not provided necessary right-hand side arguments or expressions that have not (yet) closed all their opened parentheses or brackets.

Answer 6

done with an exclamation point, this turns true into false and false into true (opposites)

Answer 7

takes two logical expressions/variables and returns TRUE ONLY if BOTH are TRUE, otherwise it will return FALSE

Answer 8

returns the expression as true if either the left or right is true. Otherwise it is returned as false Operator: |

Answer 9

takes R objects (or expressions generating an R object) and returns TRUE if both objects are identical to each other. Operator: == (Two consecutive equal signs)

Answer 10

takes R objects (or expressions generating an R object) and returns TRUE, if both objects are not identical to each other Operator: !=

Answer 11

compare numerical objects to each other | Operators: < (less than), <= (less than or equal to), > (Greater than), >= (greater than or equal to)

Answer 12

how to make logical expressions useful, help us to execute pieces of code conditionally and react to different inputs/scenarios while the program is running Syntax: if (condition){# conditional lines of code goes here} Where "condition" is a logical expression or variable. If the condition is met the code block in curly brackets will be executed if it is not it will not be executed.

Answer 13

allow us to conveniently cover two mutually exclusive cases (i.e. if one is true the other is false) Syntax: if (condition){# if "condition" is TRUE then do...} else {# if "condition" is FALSE then do...}

Answer 14

creating and manipulating simple, ordered list of a specific scalar data type Three components: Scalar data type Ordered cells Values Created through the concatenation function All elements of this has to have the same data type (you CANNOT create a one of these that contains characters and numbers), it's not possible to have a mixed type one of these

Answer 15

more complex data types that are able to contain and arrange multiple scalars (vectors are the simplest of these types)

Answer 16

require either both arguments are vectors of the same length or one of the two arguments is a single object (i.e. a vector of length one) If different length → R will give a warning and "recycle" the shorter vector from the beginning to extend its length

Answer 17

accepts vectors of logical expressions, logical vector has to be the same length as the vector we want to subset accesses a specific element in the list and returns a new list with said element Symbol: [...]

Answer 18

works just like a vector but has 2 dimensions (cells have an x- and y-position), instead of just one

Answer 19

the abstraction of both vectors and matrices, is a multidimensional matrix with an arbitrary number assigned to user-specified dimensions A 1-dimensional R ______ is equivalent to an R vector and a 2-dimensional R _______ is equivalent to an R matrix

Answer 20

are an extension of the vector idea, this is a generic collection of R objects Each element in ______ is an R object with an arbitrary data type and dimension (also allows different lengths as well as complex objects) Useful to group various types of data that belong together (where they do not conveniently fit into a single table) Can also use the assignment operator inside the ______ function to give elements names by which they can be accessed in the future Syntax of ____ function: _____( arg1, arg2, arg3...) Extraction operator

Answer 21

will access a specific R object in the list and return the said object directly. Symbol: [[...]] or $

Answer 22

any entity that encapsulates another entity, an object that holds other objects or a "container" for other objects, lists are generic wrappers for R objects, probably wont work directly with lists in this course (lists allow us to form a general intuition about other generic _________) Generic __________: Three types → S3, S4 and S5 objects

Answer 23

help us perform repetitive tasks, reduce redundancy of code and reduce the amount of code required to perform a task. Note that R only has "for each" loops and "while" loops. The index variable can help us execute the same piece of code for different inputs.

Answer 24

repeat something "N" times (we choose N) | Structure: for( i in 1:n){# repeat the following code...}

Answer 25

repeat something for each element in a set Structure: for(i in elements){# repeat the following code...} Header → defines an index variable (here named i) and an R vector of "elements" for each of which we want to repeat something Body→ the code block that will be repeated Can be read as for each "i" in a set of "elements", do this What it does (sequence of events): Sets the index variable "i" to the 1st object in "elements" Executes the code inside the body function Sets the index variable "i" to the 2nd object in "elements" Executes the code inside the body function Repeats for all objects in "elements"

Answer 26

``` important form of expression, take arbitrarily many arguments 0,1,2,3..., usually perform more complex tasks, return some desired output, R has pre-defined functions and allows users to create their own, after a fxn has been created it can be used as a shorthand to run the code encapsulated inside of it, NOTE: fxns have "local scope" whereas code outside of functions has "Global scope" (meaning variables inside of a fxn are created independently and separately from the R environment outside of the function), R will discard any assignments made inside of a function after it has been executed, when a fxn has multiple input arguments they will be assigned in order when the fxn is executed (NOTE: order can be arbitrarily changed if arguments are referred to by their names), default values for input arguments can be made by using the assignment operator next to input variables in the header line of the function Syntax: nameOfFunction(arg1,arg2,arg3...) my_function = function(arg1,arg2,arg3,...){ # function goes here... } Header: Defines the name of the function on the left of the assignment operator and the names of input variables/arguments of the function (these variable names are provided in the parenthesis) Body: the code block that will be executed when "my_function" is used ```

Answer 27

my_matrix = matrix( Data = ?, nrow = ?, byrow = ? ) Input arguments: "Data" → a vector of values or objects that will be put into the cells of the matrix "nrow" → the number of rows in the matrix "ncol" → the number of columns in the matrix "byrow" → logical, if TRUE values populate the table they populate in order via row-by-row; if FALSE values populate the table they populate in order via column-by-column

Answer 28

aka column bind, stitch A and B together into a single matrix such that the columns in B follow the right of the columns in A

Answer 29

the ________ operator can also be used in combination with the assignment operator to modify specific elements inside of the matrix, a matrix can be subset and accessed using the _________ operator [rows,columns] several different ways

Answer 30

``` the data frame is R's preferred data type to represent R X C data tables, like a matrix it has R rows and C columns and it supports all subset operations [A,B] that matrices have access to, in contrast each column has its own associated data type and each column can have a different data type (this offers a lot of additional functionality) my_tab = data.frame( column1 = c(...), column2 = c(...), Column3 = c(...), ... ) ```

Answer 31

he process of loading a file into memory Fxn: read.table, this fxn reads a simple text file containing table data into R and turns it into a data.frame object, this function expects a file to contain a plain text such that each line in the file represents the row of a table and columns are separated by a special character (usually a space, "" or tab "\t" character), in the "people.txt" file columns are separated by the "" character Syntax: my_tab = read.table ( file = "...", header = TRUE, sep = "...", )

Answer 32

the process of saving data currently in a computer's memory to a specific file Fxn: write.table, can be used to write data frame objects to simple text files and to add a new column in a dataset and saves the modified table into a separate file NOTE: omit quotes around objects and do not include row names as a separate column

Answer 33

R packages that were uploaded to CRAN can be installed via the "install.packages" fxn which accepts a package name in the form of a character object After the package is installed it can be loaded into R by using the "library" fxn which accepts a package name in the form of a character object Loading a package makes new commands available to the R user which are most commonly implemented as fxns For CRAN packages info on what's included in a package and how to used it can be found via: https://cran.r-project/web/packages/packagenamegoeshere/ For a specific example consult the lecture slides

Answer 34

a mathematical object that represents the random process behind a measurement, it can realize/assume/generate different values with different probabilities, when an experimental measurement is taken the resulting value is considered a realization of a random variable, parameters of a distribution represent truths/properties about the random process that generates the outcome data, this means that questions about outcome variables and the process that created them can generally be formulated in terms of questions about distributional parameters Events in sample space are mapped to real numbers and then assigned probabilities accomplish the following tasks: Transfer questions about real life events into questions about numbers (allow us to evaluate and manipulate any type of probabilistic event with mathematical language and powerful mathematical tools) Provide a unified framework in which we can gain insights about properties and consequences of random processes

Answer 35

the probability distribution of an RV assigns probabilities to sets of events in the sample space (i.e. the set of all possible events), exact probability distribution of a random variable is unknown (based on knowledge and assumptions about sample space and the data generating process we can assign a family of distributions to random variables)

Answer 36

is a component of probability distribution, it can be loosely understood as the function that takes an individual event from a sample space and returns its associated probability

Answer 37

have a pdf that contains some unknown parameters (choosing numbers for these parameters will create a valid example of a specific distribution that is the respective family)

Answer 38

any random process with only two possible outcomes which are denoted as X = 0 and X = 1 Form of the probability density: f(x) = p^x(1-p)^1-x p = parameter representing the probability that X = 1 occurs

Answer 39

a random process in which we measure the time X between some events that occur independently but on average at the same constant rate Form of probability density: f(x) = 𝝀e^-𝝀x 𝝀 = parameter representing the constant rate at which events occur on average

Answer 40

can assume a countable number of values

Answer 41

can assume values in a continuum of numbers (not countable)

Answer 42

a collection of random variables that are independent but share the same distribution

Answer 43

the average of all possible values X, each weighted by their respective probability

Answer 44

measure the spread of X around its mean (i.e. central value), variance has a close connection to uncertainty (if large it's scattered if small its not)

Answer 45

Assumes: the response variables are discrete counts, overdispersion of outcomes (i.e. their variances tend to be larger than their expected values, predominantly the case in gene expression data) Allows: to factor sample signal intensities into the probability distribution of counts

Answer 46

a modelling framework suitable to investigate the effects of predictor variables on outcomes following a negative binomial distribution Are able to perform estimation and hypothesis test for commonly encountered probability distributions Can handle outcome variables that are discrete or continuous and that are constrained to specific intervals GLMs → express some fxn of the mean response to be a linear combination of predictors g(E[Y]) = alpha + (Beta1) (X1) + (Beta2)(X2)...

Answer 47

express the mean of the response as a linear combination of predictors: E[Y] = alpha + (Beta1) (X1) + (Beta2)(X2)... Beta = coefficient X = predictor

Answer 48

quantify via beta values (can be obtained from DNAm microarrays), M = methylated/blue/high beta value, U = unmethylated/yellow/low beta value and black = in between

Answer 49

estimating cell type proportions for each sample by using their methylation beta values in the reference set and by using the reference signatures for each cell type Assumption → for each cell type the unknown beta values of the reference set of CpGs are approximately stable in our sample population if reference signatures (beta values) for each cell type are already known from previous studies

Answer 50

one of two strategies for analyzing data, are bounded between 0 and 1, their variance tends to change in different sub-populations, popular approach, ignore issues and fit a linear model assuming beta values follow an approximately normal distribution Debate → is this really the best statistical model to analyze the data? Outcome: Yj Outcome following normal distribution? → hard to justify, generally not satisfied Biological interpretation → easy to interpret, biologically meaningful Performance → linear model identifies substantial differences well but p-values are questionable Graph → skewed right

Answer 51

second of two strategies for analyzing data, transforms beta values to so called m-values which makes them behave more like a normal distribution THEN they fit the m-values into a linear model Outcome: mj =log (base 2) (Yj/1-Yj) Outcome follows normal distribution? → yes, approximates outcome well Biological interpretation → difficult Performance → linear model performs well in most cases Graph → slightly skewed left Operational

Answer 52

are groups of closely related species and are based on taxonomy. They can be counted and analyzed at different levels of hierarchy Higher level → includes more species, is more accurate and less specific Lower level → includes less species, is less accurate and less specific Common result → OTU count tables Compositional → total sample counts are arbitrarily fixed and differ Rows = genus Columns = samples # in cell = how often an OTU was observed in a given sample

Answer 53

pairwise differences in diversity between samples Ex → patterns of OTU counts are more similar among samples within group A and within group B than when comparing a sample from group A to a sample from group B

Answer 54

Measures the proportion of genetic change (in sample composition) that is unique to the evolution of either sample Is only concerned with how the genetic content of two samples changes with respect to presence or absence of species/OTUs OTUs with non-zero counts in sample A tend to have a variety of different genetic features than OTUs with non-zero counts in sample B? → large distance

Answer 55

How much is genetic change (in sample composition) associating with differences in relative abundance between two samples? Highly abundant OTUs in sample A have very different genetic features than highly abundant OTUs in sample B? → large distance

Answer 56

treats each sample as a bag and each observed OTU count as a colored marble then we randomly draw marbles from these bags to adjust for sequencing depth We choose a total ___________count (R) that is smaller than the smallest observed total sample count for each sample → then we randomly draw R OTUs from the set of observed OTUs in the respective sample without replacement → as a result each OTU will receive a new adjusted count to that the total sample count (R) and thus the underlying sequencing depth in each sample is fixed to the same value

Answer 57

tests for associations of 2-dimensional data matrices (samples are columns and rows are multiple outcome variables) with predictor variables based on dissimilarity scores Compares two quantities: Within-group variation (Sw) → sum of squared dissimilarity scores between subjects with the same predictor values Between-group variations (SB) → sum of squared dissimilarity scores between subjects with different predictor values

Answer 58

reads are the primary output of most next gen. Seq. they are a short sequence of letters representing a subsequence of ribonucleotides which was observed in a given sample.

Answer 59

the process of arranging two sequences of symbols in such a way that exposes their regions of similarity (usually use a reference genome to ID where it originated from

Answer 60

they are expressed as instructions on how to modify sequence 1 symbol by symbol to turn it into sequence 2 and vice versa | = a match * = substitution _ = deletion/insertion

Answer 61

trade accuracy for speed and often rely on assumptions about genomic structure

Answer 62

always guaranteed to find the optimal alignment of 2 sequences, but can be very slow when one or both of the two sequences are very large Types of exact pairwise alignment → perfect sequence matching, global alignment, local alignment

Answer 63

only interested in finding out whether a target shorter sequence is a perfect subsequent sequence of another longer sequence (if shorter sequence is different by a single symbol from the longer sequence we do not consider them a match)

Answer 64

aka "End-to-end" alignments, impose that all symbols in both sequences must be incorporated into the alignment, allow for subsequences of poor similarity to be omitted, this is a powerful tool to contrast differences between two sequences of roughly similar length. These struggle with identifying regions of high similarity originating from two sequences that except for those regions are otherwise very different from each other AND they struggle with identifying the optimal region in a very long sequence to which a small sequence maps or aligns with high similarity

Answer 65

gold standard (for global alignments), guaranteed to find optimal global alignments while also achieving efficient run-times due to a dynamic programming strategy, still widely used in modern research, "high-tech" but "easy" to implement and understand

Answer 66

aim to ID the most similar regions within sequences, they dynamically decide how large subsequences within the two target sequences should be in order to maximize local similarity , they zoom in to one part of the first sequence and one part of the 2nd alignment to find matching sequences that are similar enough.

Answer 67

gold standard for local alignments, guaranteed to ID the optimal local alignment and is a direct extension of the needleman-wunsch (only needs a few small adjustments)

Answer 68

1) Initialize the first row and first column of the dpm with the value 0 in each cell 2) Initialize the first row and first column of the pointer matrix with the value 8 in each cell (the pointer value 8 = "no-previous-position" and if the pointer value contains the value 8 then the optimal path traversing position (i,j) does not have a previous position, in other words any optimal path including position (i,j) has to start at said position) 3) When moving through the dpm step by step and calculating scores for dpm[i,j] a third option is added on top of coming from the diagonal, left and up direction 4) If the best possible similarity score of these 3 directions is less than 0 then we assign dpm [i,j] to be 0 and save the value 9 at coordinate (i,j) of the pointer matrix

Answer 69

Terminate a path at any point if moving forward can only lead through regions with a large number of mismatches Allow optimal paths to start at any appropriate point in the dpm if doing so maximizes local similarity Allow multiple optimal paths with the same degree of similarity to be identified at the same time

Answer 70

1) Any alignment can be represented as a path through the dpm 2) In contrast to global alignments and an alignment of two sub-sequences doesn't start at the top left corner (but a different cell) and doesn't end at the bottom right corner (but at a different cell)

Answer 71

both global and local alignments can extend and adjust their scoring systems to match other research settings and ID different types of similarity patterns In aligning AA sequences to evaluate proteins it is no longer meaningful to just consider matches and mismatches Why → there are a lot of ways to mismatch or substitute symbols for each other, certain substitutions correspond to more similar sequences than others (either evolutionarily or structurally) and certain matches correspond to higher similarity than others TAKE HOME MESSAGE: before performing pairwise alignments researchers should carefully consider the alignment algorithm and scoring system and base their choices on best practice guidelines to made sure they are identifying the correct types of similarity patterns

Answer 72

are very powerful (are able to relate two sequences to each other that would otherwise appear to be very different through their shared relatedness to other sequences), compares multiple (3 or more) sequences at the same time, are able to ID subsequences that are highly similar among multiple candidates in the set of aligned sequences and often try to ID a consensus sequence among a set of related sequences (the consensus sequence is constructed so that on average all aligned sequences show the smallest difference from it which allows researchers to analyze variability of individual sequences from a shared group/family consensus) Uses: to compare DNA, RNA or protein sequences that are suspected to be evolutionarily related (help quantify their degree of relatedness), identify homologous genes that trace back to a common ancestor and identify conserved elements in genomic regions (for example: transcription factor binding sites)

Answer 73

can be obtained through a dynamic programming strategy (analogously to the NW algorithm) but instead of using a 2-dimensional dpm this strategy uses a M-dimensional dpm when aligning M sequences and as a consequence exact global alignments become very slow when M is moderately large or the aligned sequences are moderately long NOTE: in most research scenarios heuristic MSA methods are the only feasible option and there are many different types of heuristic MSA methods (each relies on different assumptions and incorporates different pieces on external info)

Answer 74

most commonly used strategy in research Examples: ClustalW, GlustalOmega, T-Coffee, PSAlign Steps → they work by performing the following 5 steps 1) Starts by performing all possible combinations of pairwise alignments between the to be compared sequences 2) Based on the pairwise degrees of similarity sequences are organized into a tree that represents their relatedness (where the closer they are the more related they are) 3) Starting with the alignment of the two most similar sequences, the MSA is then progressively updated by incorporating one sequence at a time 4) Which sequence is incorporated in each step is determined by a rule that systematically collapses the "relatedness tree" 5) After all sequences have been incorporated the MSA is finished

Answer 75

a collection of electronically stored data Contains: The raw information itself (binary digits, numbers, text, etc), Info about the structure of different pieces of data, Info about the relationships between different pieces of data

Answer 76

Whenever data is accessed, generated or modified in one or more of the following ways a DBS can become valuable: Over long periods of time, By many different entities, With different permissions and ownerships In large volumes, Without adequate care the complexity arising from these scenarios can quickly start to cause problems such as data safety concerns, faulty data and physical disk space limitations Since any user can only interact with the database through the DBMS we can control operation son the data very carefully and minimize or prevent problems and unintended consequences Frequently encountered in biomedical research when storing high data volumes (NGS data, large scale projects,...), accessing experimental data from public databases (EMBL, GenBank, GEO, KEGG,...), accessing and recording PHI Technical knowledge about databases can help us to better interact with the various data sources, to better organize their own experimental data, and with designing experiments and data input forms that meet study requirements and data input forms that meet study requirements and for sufficient to answer research questions

Answer 77

the primary software system through which users create, modify and request info from the database

Answer 78

the union of the database, the DBMS and other associated software that is interacting with the database NOTE:The DBS is often informally referred to as "the database"

Answer 79

a programming language that allows users to define operations that they would like the DBMS to perform NOTE: in order for a database language to be applicable to a DBS it has to be supported by the DBMS and assume the same structure as the database

Answer 80

there are several possible of these that a DBS can use (each with different strengths and weaknesses) Many modern architectures follow the "client-server model" Server side: a computer, computing cluster or cloud system that stores the database and runs the DBMS software Client side: an often remotely located computer or terminal communicating with the server side through a network or the internet A user uses this software to send requests to the server side. On the server the DBMS handles requests, communicates with the users and performs operations on the database

Answer 81

most common type of database, widely applied throughout all industries, serve as a great model to understand the challenges any type of modern database has to be able to address the benefits they provide

Answer 82

the most commonly used database language for relational databases Examples of relational DBMS that primarily use SQL: "MS SQL Server", "IBM DB2", "Oracle", "MySQL", and "microsoft access"

Answer 83

``` Data is stored and presented as a collection of "relations" This model enables to create a wide array of different data structures including one-to-one, one-to-many and hierarchical data structures Relation: a table with a unique name that stores a set of tuples (rows) that share the same types of attributes (columns) Each relation has to contain primary keys → this is either an attribute or a combination of multiple attributes that uniquely identifies each tuple in the relation, the primary key of a specific tuple is often chosen to be a unique identification number or character string Each attribute (column) satisfies the following: has a name that is unique within the relation, has a single well defined domain/data type (number, text, object...), the position of the attribute does not carry info Each tuple (row) satisfies the following: carries info represented by values of attributes belonging to a specific entity, the position of the tuple does not carry information, each tuple is unique; there can be no duplicate tuples ```

Answer 84

Non-primary key attribute cells may contain missing values (can denote this as NA which is also used in R to show missing data) Data relationships are represented by foreign keys (attributes that refer to primary keys in other tables) A relation may contain anywhere from non of up to arbitrarily many foreign keys If a foreign key value is not missing it has to reference an existing primary key value

Answer 85

Databases are constructed to minimize problems, optimize performance and save storage space This solves this problem: If the "orders" relation incorporated the full treatment and subject info for every single subject there would be a large amount of redundancy that unnecessarily increased required storage space By referring to treatments and subjects by foreign keys is a much more efficient use of physical disk space On the other hand, incorporating the full information about treatment and subject makes it easy for inconsistencies to arise (ex: if a subject's name was misspelled in one of the tuples in the table) However, often the corresponding tables representing the relations stored in a database are not of primary interest to end-users Most users tend to be interested in combinations of subsets of database relations

Answer 86

a table generated from one or more relations in the database, that contains some or all of their content Users can request views of the database from the DBMS, which will fetch and assemble the respective pieces of information from the raw database

Answer 87

problematic situations that can arise while a DBS is operating and info in the database is updated, these often occur due to flaws in the design of a database

Answer 88

a situation in which new data can only be added to the database by introducing missing values or otherwise it cannot be added at all Ex: is we want to add a new medical procedure that will be offered in the future to the database we have to do so by assigning missing values to subject into attributes

Answer 89

a situation in which deleting data that represents a certain type of info necessarily will also delete data that represents a different type of info Ex: if we want to remove a medical procedure that is no longer offered that info about all subjects that only receive this procedure will also be deleted NOTE: we want to delete the procedure but not the subjects

Answer 90

a situation in which identical pieces of info are expressed in multiple tuples and updating only some of them leads to inconsistencies, it's hard to verify the real value of the data

Answer 91

It is able to store the desired info and represent all relevant relationships between different pieces of data To achieve this a database specialist first collaborates with a client or domain expert to create a draft of the database It is preventing serious complications before they can occur Achieved through database normalization in which the structure of the unnormalized database is modified step by step to meet more and more stringent criteria

Answer 92

Specifies the following: every attribute is atomic (i.e. it contains a single value of a specified data type and no group/list of values) and this is already a formal requirement for relational databases To achieve: define a primary key consisting of two attributes Undesirable properties: certain attributes depend on others whereas others do not (this is a glaring source of redundancy and potential error)

Answer 93

Specifies the following: all non-key attributes (in each relation) are functionally dependent on the entire primary key, the database satisfies 1NF NOTE: Attributes only depend on the primary key Three attributes (A, B, C) have a transitive dependency if A depends on B and B depends on C (source of redundancy and potential error

Answer 94

Specifies the following: there are no transitive dependencies among attributes, the database satisfies 2NF Structure resembles the multiple tables presented in previous lecture NOTE: A relational database is considered normalised if it satisfies 3NF This is generally considered a well designed database Free of most anomalies There are also additional, more stringent normal forms Ex: BCNF (requires 3NF), 4NF (requires BCNF), 5NF (requires 4NF)

Answer 95

- Faster turn-around for analysis personnel and collaborators (makes it easier to enter data into existing databases and to process data with analysis software) - Makes it more difficult for inconsistencies to occur - For tables that store raw data from your own experiment that will be submitted to a collaborator does the following: - Define/pick a column that serves as the primary key; the values in this column should be able to uniquely identify each row - If there is no obvious choice for a primary key, it might be a good idea to create unique ID numbers for each row in a table - If a raw data table contains sets of columns that represent different types of info that might be valuable to split the table apart according to these types, especially if they are referenced in other tables - Organize tables so there is no duplicated info appearing in multiple tables - Whether or not these objectives are worth the effort will depend on the number of affected tables and columns and also the total data volume

Answer 96

Key factor leading to the birth of bioinformatics was the increased volume and complexity of data that biomedical researchers were faced with analyzing to answer their research questions since the late 20th century. Since inception, Bioinformatics has heavily utilized and benefitted from databases that catalogue and share experimental data Today large-scale public database systems containing info about "omics data" are a corner-stone of modern clinical and biological research Strengths: Facilitate fast access to stored info Provide many online tools specialized in performing comparisons and analyses by directly communicating with different databases

Answer 97

houses a wide array of frequently used online resources. All of the provided components can either be accessed through the search bar at the top or by navigating through the blues menu options on the left Examples: Medline: widely known in the biomedical field for its primary search engine "pubmed" The database of genotypes and phenotypes (dbGaP): contains "data and results from studies that have investigated the interaction of genotype and phenotype in humans dbSNP: contains "a broad collection of simple genetic polymorphisms (SNPs, small indels, MNPs) ClinVar: collects info about genetic variation and its relationship to human health GenBank Refseq The gene expression Omnibus database (GEO) Etc.

Answer 98

Its Entrez search system

Answer 99

cross-database search system that allows to simultaneously search all public NCBI databases at once, allows for searching for both individual character strings and logical combinations of multiple character strings, offers filters and search refinement options to continuously narrow searches down to a desired target, incorporates many data visualization features, search queries are accessible through both web-interfaces and direct interfaces with programming languages

Answer 100

an annotated collection of all publicly available DNA sequences, houses submissions made by various entities (including labs and sequencing projects), part of the international nucleotide sequence database collaboration (INSDC) which means that it daily receives data directly submitted to it and nucleotide data from the DNA databank of japan (DDBJ) and the european nucleotide archive (ENA), this is what the majority of searches will target when using the "nucleotide" search option of the Entrez search bar

Answer 101

owned by NCBI and not part of INSDC, short for reference sequence, aims to provide a non-redundant, curated data representing our current knowledge of known genes, houses well annotated reference sequences of genomes, transcripts and proteins

Answer 102

originally designed to be a public data archive for high-throughput gene expression datasets obtained from microarray and RNA sequencing studies but later started to accept additional types of experimental data that relate to gene expression and gene regulation (DNA methylation, genome binding, protein profiling, chromosome conformation, genome copy number variation) Follows the MIAME and MISEQE guidelines which distil a study down to the minimum info required to describe it (MIAME = minimum info about a microarray experiment, MISEQE = minimum info about a next-generation sequencing experiment)

Answer 103

an interactive web tool that performs comparative analyses between groups of process samples using the "GEOquery" and "limma" R packages from the Bioconductor project

Answer 104

GEOquery allows users to directly download and import GEO datasets into R with a single command In combination with other Bioconductor packages (such as limma, DESeq2, etc.) this allows for creating complete analysis workflows pooling data form various studies natively within R

Answer 105

Europe's flagship laboratory for the life sciences" intergovernmental organization 27 supporting countries, most European, but Australia and Argentina also hold associate member status

Answer 106

part of EMBL that focuses on performing bioinformatics research develops and providing bioinformatics services and provides a large number of public databases and bioinformatics tools that can be accessed online via their website

Answer 107

UniProt: information about protein sequences & biological function of proteins •PDBe: "the European resource for the collection, organization and dissemination of data on biological macromolecular structures" •Expression Atlas: large database holding information about gene expression signatures and differential expression •Ensembl •and many more ...

Answer 108

* ClustalOmega * BLAST * HMMER: alignments geared towards identifying homologous protein or nucleotide sequences * EnsemblGenome Browser * and many more ...

Answer 109

Project of both EBI and the Welcome Trust Sanger Institute •Public DBS containing curated genetic consensus information •Analogously to NCBI's RefSeq,it houses well annotated reference sequences of genomes, transcripts and proteins •"Automatically annotates genomes" of specific model organisms and "integrates this annotation with other available biological data" •High degree of transparency: all data and source code used to generate annotations is publicly available •Known for high quality annotations and the ENSEMBL genome browser

Answer 110

identifiers used by Ensemblto refer to different types of information; can be understood as primary keys for their database entries; used in many studies outside of Ensembl •Homo sapiensformat: " ENS objectTypenumberString. version " •objectType: consists of multiple letters denoting different types of data •numberString: string containing numeric letters of various length •version: a version number NOTE: Both Ensembl and RefSeqIDs are frequently used. The same gene could be referred to with either type of ID

Answer 111

These browsers allow researchers to dynamically visualize genomes alongside with annotated data and they display nucleobase positions of chromosomes on the X-axis and use the Y-axis to display information as a function of base position. Annotations may include: •Genes (both validated and predicted) and transcript variants •Structural and regulatory elements (both validated and predicted) •Read counts, expression levels and methylation levels •Information about sequence variation in different populations •...and more

Answer 112

They can be fairly overwhelming when starting to work with them (more or less intuitive based in the user's preferences)

Answer 113

Ensembl, NCBI GDV and UCSC Genome Browser

Answer 114

is also hosted by the UCSC Genomics Institute•Both available as an online browser and a downloadable desktop application •Offers a wide variety of annotation tracks sourced from many different databases •Allows uploading data and displaying it in "custom tracks" in various formats

Bioinformatics Exam 3 Review Flashcards

(138 cards)