Bioinformatics Exam 3 Review Flashcards

1
Q

Programming

A

helps in collecting and manipulating data, automating analysis workflows (to show people what you did), minimizing human error and generating reproducible reports, quick processing of large datasets and repetitive tasks, visualizing and making sense of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Programming language

A

letters/symbols create words according to rules, language for humans to formulate instructions for computers to generate some desired output, compiler and interpreter software allow an instruction formulated in a programming language to be translated into executable machine level operations.
Pathway: Instruction (in your mind) → instruction in programming language → instruction in machine level language → execution of instruction/computation → generated output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Source Code

A

a set of instructions formulated in a programming language that is readable by humans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Program

A

a set of instructions stored in a form that can be executed by a computer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Compiler

A

a software that translates source code into a machine level program that is (usually) efficiently optimized for the machine it is compiled for
Time to translate → Slow
Time to execute → Fast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Interpreter

A

translates source code scripts into machine level operations “on the fly” and executes them line by line
Time to translate → Fast
Time to execute → Slow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

1976

A

Chambers, Becker and Wilks develop the S statistical programming language at Bell laboratories
Aim: facilitate quick transitions from idea to software
This Interpreter based language allowed modifications, testing and trouble shooting of programs quick and convenient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

1993

A

Ihaka and Gentleman re-implement S and Name it the “R programming language”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

1995

A

R is decided to be made freely available under the GNU General Public license (But not officially released)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

1997

A

R Core Group is founded and starts taking control of R’s further development, the Comprehensive R Archive Network (CRAN) is launched, enabling sharing and curation of user developed components that extends R’s capabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

2000

A

R version 1.0.0 is released to the general public

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

2009

A

New york Times article: “Data Analysts Captivated by R’s Power”, Ashlee Vance
Good description of how R makes a difference → Daryl Pregibon (Google): “it allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2017

A

a study found that R has shown extreme growth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2019

A

Another study found that R is the most requested programming language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Comprehensive R Archive Network (CRAN)

A

a network of ftp and web servers storing versions of code and documentation for R. This serves as the main general purpose repository for R packages and if there is something common that is a common problem you can use a pre-made package to solve the answer to your problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

R

A

language and environment for statistical computing and graphics, open source language that is free, provides tools for statisticians, data miners, data analysts, data scientists and academic researchers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bioconductor

A

Another R package repository, free, dedicated to the analysis of genomic data and biological high-throughput assays, primary focus on an R package repository serving the needs of bioinformaticians and biomedical researchers
Packages available: >1800
Mission: accessibility of powerful analysis and visualization tools, reproducible research, rapid development of software components that are both scalable and compatible with each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Commands in R

A

R’s interpreter can process 2 forms of these → expressions and assignments, these can be separated by line-breaks or the “;” character, individual components within commands can be arbitrary separated by spaces and tabs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Expressions

A

commands that are evaluated, printed (optional) and their output is lost, these take some input arguments or values and return some output values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Operators

A

are generally expressed via 1 to 3 consecutive special characters and often handle fundamental, essential programming tasks, there are several other operators that handle tasks such as logic or comparison
Examples: ? opens a webpage with helpful documentation and explanations of a function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Objects

A

individual pieces of data that have two major attributes,:
Data type:what type of information it contains
Value: the actual information that it contains
NOTE: internally the value of an object is just a bunch of zeros and ones in the memory of the computer the data type is what tells R how to interpret and display the value of the object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Scalar and multidimensional data types

A

the two fundamental classes of data types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Character Objects

A

display letters, words and text, wrapped in quotation marks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Logical Objects

A

only two possible values (yes and no/ true and false {abbreviated T and F}), used when you want to check or remember whether or not something is true or has happened when you run a program.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Numerical Objects
integers and decimals
26
Parenthesis
used to group arguments of expressions in conjunction with commas and can be used to control the order of operations in expressions, the expression enclosed within the innermost parentheses will always be evaluated first
27
Assignment Commands
commands that evaluate an expression and store it, so that it can be accessed again in the future. These store objects in variables where the expression on the right-hand side can either be an R object or any type of valid expression that creates an R object and the left hand side is a variable that can be understood as a label or name that is attached to an object in order for R to remember it.
28
R Console
an interactive interface for the command line interpreter (it comes up when you open R), commands can be typed into the console and executed by hitting the key
29
Incomplete Commands
expressions that have not provided necessary right-hand side arguments or expressions that have not (yet) closed all their opened parentheses or brackets.
30
Negation/ "NOT" operator
done with an exclamation point, this turns true into false and false into true (opposites)
31
Logical "AND"
takes two logical expressions/variables and returns TRUE ONLY if BOTH are TRUE, otherwise it will return FALSE
32
Logical "OR"
returns the expression as true if either the left or right is true. Otherwise it is returned as false Operator: |
33
"Equal to" Operator
takes R objects (or expressions generating an R object) and returns TRUE if both objects are identical to each other. Operator: == (Two consecutive equal signs)
34
"Not equal to" Operator
takes R objects (or expressions generating an R object) and returns TRUE, if both objects are not identical to each other Operator: !=
35
Inequality Operators
compare numerical objects to each other | Operators: < (less than), <= (less than or equal to), > (Greater than), >= (greater than or equal to)
36
If blocks
how to make logical expressions useful, help us to execute pieces of code conditionally and react to different inputs/scenarios while the program is running Syntax: if (condition){# conditional lines of code goes here} Where "condition" is a logical expression or variable. If the condition is met the code block in curly brackets will be executed if it is not it will not be executed.
37
If...else statements
allow us to conveniently cover two mutually exclusive cases (i.e. if one is true the other is false) Syntax: if (condition){# if "condition" is TRUE then do...} else {# if "condition" is FALSE then do...}
38
Vectors
creating and manipulating simple, ordered list of a specific scalar data type Three components: Scalar data type Ordered cells Values Created through the concatenation function All elements of this has to have the same data type (you CANNOT create a one of these that contains characters and numbers), it's not possible to have a mixed type one of these
39
Multi-dimensional Data types
more complex data types that are able to contain and arrange multiple scalars (vectors are the simplest of these types)
40
Binary operators
require either both arguments are vectors of the same length or one of the two arguments is a single object (i.e. a vector of length one) If different length → R will give a warning and "recycle" the shorter vector from the beginning to extend its length
41
Subset operator
accepts vectors of logical expressions, logical vector has to be the same length as the vector we want to subset accesses a specific element in the list and returns a new list with said element Symbol: [...]
42
Matrix
works just like a vector but has 2 dimensions (cells have an x- and y-position), instead of just one
43
Array
the abstraction of both vectors and matrices, is a multidimensional matrix with an arbitrary number assigned to user-specified dimensions A 1-dimensional R ______ is equivalent to an R vector and a 2-dimensional R _______ is equivalent to an R matrix
44
List
are an extension of the vector idea, this is a generic collection of R objects Each element in ______ is an R object with an arbitrary data type and dimension (also allows different lengths as well as complex objects) Useful to group various types of data that belong together (where they do not conveniently fit into a single table) Can also use the assignment operator inside the ______ function to give elements names by which they can be accessed in the future Syntax of ____ function: _____( arg1, arg2, arg3...) Extraction operator
45
Extraction operator
will access a specific R object in the list and return the said object directly. Symbol: [[...]] or $
46
Wrappers
any entity that encapsulates another entity, an object that holds other objects or a "container" for other objects, lists are generic wrappers for R objects, probably wont work directly with lists in this course (lists allow us to form a general intuition about other generic _________) Generic __________: Three types → S3, S4 and S5 objects
47
Loops
help us perform repetitive tasks, reduce redundancy of code and reduce the amount of code required to perform a task. Note that R only has "for each" loops and "while" loops. The index variable can help us execute the same piece of code for different inputs.
48
"For" loops
repeat something "N" times (we choose N) | Structure: for( i in 1:n){# repeat the following code...}
49
"For each" loops
repeat something for each element in a set Structure: for(i in elements){# repeat the following code...} Header → defines an index variable (here named i) and an R vector of "elements" for each of which we want to repeat something Body→ the code block that will be repeated Can be read as for each "i" in a set of "elements", do this What it does (sequence of events): Sets the index variable "i" to the 1st object in "elements" Executes the code inside the body function Sets the index variable "i" to the 2nd object in "elements" Executes the code inside the body function Repeats for all objects in "elements"
50
Functions
``` important form of expression, take arbitrarily many arguments 0,1,2,3..., usually perform more complex tasks, return some desired output, R has pre-defined functions and allows users to create their own, after a fxn has been created it can be used as a shorthand to run the code encapsulated inside of it, NOTE: fxns have "local scope" whereas code outside of functions has "Global scope" (meaning variables inside of a fxn are created independently and separately from the R environment outside of the function), R will discard any assignments made inside of a function after it has been executed, when a fxn has multiple input arguments they will be assigned in order when the fxn is executed (NOTE: order can be arbitrarily changed if arguments are referred to by their names), default values for input arguments can be made by using the assignment operator next to input variables in the header line of the function Syntax: nameOfFunction(arg1,arg2,arg3...) my_function = function(arg1,arg2,arg3,...){ # function goes here... } Header: Defines the name of the function on the left of the assignment operator and the names of input variables/arguments of the function (these variable names are provided in the parenthesis) Body: the code block that will be executed when "my_function" is used ```
51
Matrix function
my_matrix = matrix( Data = ?, nrow = ?, byrow = ? ) Input arguments: "Data" → a vector of values or objects that will be put into the cells of the matrix "nrow" → the number of rows in the matrix "ncol" → the number of columns in the matrix "byrow" → logical, if TRUE values populate the table they populate in order via row-by-row; if FALSE values populate the table they populate in order via column-by-column
52
cbind(A,B)
aka column bind, stitch A and B together into a single matrix such that the columns in B follow the right of the columns in A
53
Subset
the ________ operator can also be used in combination with the assignment operator to modify specific elements inside of the matrix, a matrix can be subset and accessed using the _________ operator [rows,columns] several different ways
54
Data Frames
``` the data frame is R's preferred data type to represent R X C data tables, like a matrix it has R rows and C columns and it supports all subset operations [A,B] that matrices have access to, in contrast each column has its own associated data type and each column can have a different data type (this offers a lot of additional functionality) my_tab = data.frame( column1 = c(...), column2 = c(...), Column3 = c(...), ... ) ```
55
Reading
he process of loading a file into memory Fxn: read.table, this fxn reads a simple text file containing table data into R and turns it into a data.frame object, this function expects a file to contain a plain text such that each line in the file represents the row of a table and columns are separated by a special character (usually a space, "" or tab "\t" character), in the "people.txt" file columns are separated by the "" character Syntax: my_tab = read.table ( file = "...", header = TRUE, sep = "...", )
56
Writing
the process of saving data currently in a computer's memory to a specific file Fxn: write.table, can be used to write data frame objects to simple text files and to add a new column in a dataset and saves the modified table into a separate file NOTE: omit quotes around objects and do not include row names as a separate column
57
Installing and loading R packages
R packages that were uploaded to CRAN can be installed via the "install.packages" fxn which accepts a package name in the form of a character object After the package is installed it can be loaded into R by using the "library" fxn which accepts a package name in the form of a character object Loading a package makes new commands available to the R user which are most commonly implemented as fxns For CRAN packages info on what's included in a package and how to used it can be found via: https://cran.r-project/web/packages/packagenamegoeshere/ For a specific example consult the lecture slides
58
Random Variable
a mathematical object that represents the random process behind a measurement, it can realize/assume/generate different values with different probabilities, when an experimental measurement is taken the resulting value is considered a realization of a random variable, parameters of a distribution represent truths/properties about the random process that generates the outcome data, this means that questions about outcome variables and the process that created them can generally be formulated in terms of questions about distributional parameters Events in sample space are mapped to real numbers and then assigned probabilities accomplish the following tasks: Transfer questions about real life events into questions about numbers (allow us to evaluate and manipulate any type of probabilistic event with mathematical language and powerful mathematical tools) Provide a unified framework in which we can gain insights about properties and consequences of random processes
59
Probability Distribution
the probability distribution of an RV assigns probabilities to sets of events in the sample space (i.e. the set of all possible events), exact probability distribution of a random variable is unknown (based on knowledge and assumptions about sample space and the data generating process we can assign a family of distributions to random variables)
60
Probability density function (pdf)
is a component of probability distribution, it can be loosely understood as the function that takes an individual event from a sample space and returns its associated probability
61
Families of distributions
have a pdf that contains some unknown parameters (choosing numbers for these parameters will create a valid example of a specific distribution that is the respective family)
62
Bernoulli distribution family
any random process with only two possible outcomes which are denoted as X = 0 and X = 1 Form of the probability density: f(x) = p^x(1-p)^1-x p = parameter representing the probability that X = 1 occurs
63
Exponential distribution family
a random process in which we measure the time X between some events that occur independently but on average at the same constant rate Form of probability density: f(x) = 𝝀e^-𝝀x 𝝀 = parameter representing the constant rate at which events occur on average
64
Distributions with discrete RVs
can assume a countable number of values
65
Distributions of continuous RVs
can assume values in a continuum of numbers (not countable)
66
Random sample
a collection of random variables that are independent but share the same distribution
67
Expected Value
the average of all possible values X, each weighted by their respective probability
68
Variance
measure the spread of X around its mean (i.e. central value), variance has a close connection to uncertainty (if large it's scattered if small its not)
69
Negative Binomial distribution
Assumes: the response variables are discrete counts, overdispersion of outcomes (i.e. their variances tend to be larger than their expected values, predominantly the case in gene expression data) Allows: to factor sample signal intensities into the probability distribution of counts
70
Generalized Linear Model (glm)
a modelling framework suitable to investigate the effects of predictor variables on outcomes following a negative binomial distribution Are able to perform estimation and hypothesis test for commonly encountered probability distributions Can handle outcome variables that are discrete or continuous and that are constrained to specific intervals GLMs → express some fxn of the mean response to be a linear combination of predictors g(E[Y]) = alpha + (Beta1) (X1) + (Beta2)(X2)...
71
Linear models
express the mean of the response as a linear combination of predictors: E[Y] = alpha + (Beta1) (X1) + (Beta2)(X2)... Beta = coefficient X = predictor
72
Quantifying DNAm
quantify via beta values (can be obtained from DNAm microarrays), M = methylated/blue/high beta value, U = unmethylated/yellow/low beta value and black = in between
73
Deconvolution
estimating cell type proportions for each sample by using their methylation beta values in the reference set and by using the reference signatures for each cell type Assumption → for each cell type the unknown beta values of the reference set of CpGs are approximately stable in our sample population if reference signatures (beta values) for each cell type are already known from previous studies
74
Beta value approach
one of two strategies for analyzing data, are bounded between 0 and 1, their variance tends to change in different sub-populations, popular approach, ignore issues and fit a linear model assuming beta values follow an approximately normal distribution Debate → is this really the best statistical model to analyze the data? Outcome: Yj Outcome following normal distribution? → hard to justify, generally not satisfied Biological interpretation → easy to interpret, biologically meaningful Performance → linear model identifies substantial differences well but p-values are questionable Graph → skewed right
75
M-value approach
second of two strategies for analyzing data, transforms beta values to so called m-values which makes them behave more like a normal distribution THEN they fit the m-values into a linear model Outcome: mj =log (base 2) (Yj/1-Yj) Outcome follows normal distribution? → yes, approximates outcome well Biological interpretation → difficult Performance → linear model performs well in most cases Graph → slightly skewed left Operational
76
Operational Taxonomic Units (OTUs)
are groups of closely related species and are based on taxonomy. They can be counted and analyzed at different levels of hierarchy Higher level → includes more species, is more accurate and less specific Lower level → includes less species, is less accurate and less specific Common result → OTU count tables Compositional → total sample counts are arbitrarily fixed and differ Rows = genus Columns = samples # in cell = how often an OTU was observed in a given sample
77
Beta diversity
pairwise differences in diversity between samples Ex → patterns of OTU counts are more similar among samples within group A and within group B than when comparing a sample from group A to a sample from group B
78
Unweighted UniFrac
Measures the proportion of genetic change (in sample composition) that is unique to the evolution of either sample Is only concerned with how the genetic content of two samples changes with respect to presence or absence of species/OTUs OTUs with non-zero counts in sample A tend to have a variety of different genetic features than OTUs with non-zero counts in sample B? → large distance
79
Weighted UniFrac
How much is genetic change (in sample composition) associating with differences in relative abundance between two samples? Highly abundant OTUs in sample A have very different genetic features than highly abundant OTUs in sample B? → large distance
80
Rarefaction
treats each sample as a bag and each observed OTU count as a colored marble then we randomly draw marbles from these bags to adjust for sequencing depth We choose a total ___________count (R) that is smaller than the smallest observed total sample count for each sample → then we randomly draw R OTUs from the set of observed OTUs in the respective sample without replacement → as a result each OTU will receive a new adjusted count to that the total sample count (R) and thus the underlying sequencing depth in each sample is fixed to the same value
81
PERMANOVA
tests for associations of 2-dimensional data matrices (samples are columns and rows are multiple outcome variables) with predictor variables based on dissimilarity scores Compares two quantities: Within-group variation (Sw) → sum of squared dissimilarity scores between subjects with the same predictor values Between-group variations (SB) → sum of squared dissimilarity scores between subjects with different predictor values
82
Read
reads are the primary output of most next gen. Seq. they are a short sequence of letters representing a subsequence of ribonucleotides which was observed in a given sample.
83
Pairwise sequence alignments
the process of arranging two sequences of symbols in such a way that exposes their regions of similarity (usually use a reference genome to ID where it originated from
84
Expression
they are expressed as instructions on how to modify sequence 1 symbol by symbol to turn it into sequence 2 and vice versa | = a match * = substitution _ = deletion/insertion
85
heuristic alignments
trade accuracy for speed and often rely on assumptions about genomic structure
86
exact alignments
always guaranteed to find the optimal alignment of 2 sequences, but can be very slow when one or both of the two sequences are very large Types of exact pairwise alignment → perfect sequence matching, global alignment, local alignment
87
Perfect sequence matching
only interested in finding out whether a target shorter sequence is a perfect subsequent sequence of another longer sequence (if shorter sequence is different by a single symbol from the longer sequence we do not consider them a match)
88
Global Alignment
aka "End-to-end" alignments, impose that all symbols in both sequences must be incorporated into the alignment, allow for subsequences of poor similarity to be omitted, this is a powerful tool to contrast differences between two sequences of roughly similar length. These struggle with identifying regions of high similarity originating from two sequences that except for those regions are otherwise very different from each other AND they struggle with identifying the optimal region in a very long sequence to which a small sequence maps or aligns with high similarity
89
Needleman-Wunsch algorithm
gold standard (for global alignments), guaranteed to find optimal global alignments while also achieving efficient run-times due to a dynamic programming strategy, still widely used in modern research, "high-tech" but "easy" to implement and understand
90
Local Alignment
aim to ID the most similar regions within sequences, they dynamically decide how large subsequences within the two target sequences should be in order to maximize local similarity , they zoom in to one part of the first sequence and one part of the 2nd alignment to find matching sequences that are similar enough.
91
Smith-Waterman (SW) algorithm (1988)
gold standard for local alignments, guaranteed to ID the optimal local alignment and is a direct extension of the needleman-wunsch (only needs a few small adjustments)
92
Changes made to NW to make SW
1) Initialize the first row and first column of the dpm with the value 0 in each cell 2) Initialize the first row and first column of the pointer matrix with the value 8 in each cell (the pointer value 8 = "no-previous-position" and if the pointer value contains the value 8 then the optimal path traversing position (i,j) does not have a previous position, in other words any optimal path including position (i,j) has to start at said position) 3) When moving through the dpm step by step and calculating scores for dpm[i,j] a third option is added on top of coming from the diagonal, left and up direction 4) If the best possible similarity score of these 3 directions is less than 0 then we assign dpm [i,j] to be 0 and save the value 9 at coordinate (i,j) of the pointer matrix
93
Things the SW accomplishes
Terminate a path at any point if moving forward can only lead through regions with a large number of mismatches Allow optimal paths to start at any appropriate point in the dpm if doing so maximizes local similarity Allow multiple optimal paths with the same degree of similarity to be identified at the same time
94
differences in local vs global alignment
1) Any alignment can be represented as a path through the dpm 2) In contrast to global alignments and an alignment of two sub-sequences doesn't start at the top left corner (but a different cell) and doesn't end at the bottom right corner (but at a different cell)
95
Scoring Systems
both global and local alignments can extend and adjust their scoring systems to match other research settings and ID different types of similarity patterns In aligning AA sequences to evaluate proteins it is no longer meaningful to just consider matches and mismatches Why → there are a lot of ways to mismatch or substitute symbols for each other, certain substitutions correspond to more similar sequences than others (either evolutionarily or structurally) and certain matches correspond to higher similarity than others TAKE HOME MESSAGE: before performing pairwise alignments researchers should carefully consider the alignment algorithm and scoring system and base their choices on best practice guidelines to made sure they are identifying the correct types of similarity patterns
96
Multiple Sequence Alignment (MSAs)
are very powerful (are able to relate two sequences to each other that would otherwise appear to be very different through their shared relatedness to other sequences), compares multiple (3 or more) sequences at the same time, are able to ID subsequences that are highly similar among multiple candidates in the set of aligned sequences and often try to ID a consensus sequence among a set of related sequences (the consensus sequence is constructed so that on average all aligned sequences show the smallest difference from it which allows researchers to analyze variability of individual sequences from a shared group/family consensus) Uses: to compare DNA, RNA or protein sequences that are suspected to be evolutionarily related (help quantify their degree of relatedness), identify homologous genes that trace back to a common ancestor and identify conserved elements in genomic regions (for example: transcription factor binding sites)
97
Exact global MSAs
can be obtained through a dynamic programming strategy (analogously to the NW algorithm) but instead of using a 2-dimensional dpm this strategy uses a M-dimensional dpm when aligning M sequences and as a consequence exact global alignments become very slow when M is moderately large or the aligned sequences are moderately long NOTE: in most research scenarios heuristic MSA methods are the only feasible option and there are many different types of heuristic MSA methods (each relies on different assumptions and incorporates different pieces on external info)
98
Progressive MSA algorithms
most commonly used strategy in research Examples: ClustalW, GlustalOmega, T-Coffee, PSAlign Steps → they work by performing the following 5 steps 1) Starts by performing all possible combinations of pairwise alignments between the to be compared sequences 2) Based on the pairwise degrees of similarity sequences are organized into a tree that represents their relatedness (where the closer they are the more related they are) 3) Starting with the alignment of the two most similar sequences, the MSA is then progressively updated by incorporating one sequence at a time 4) Which sequence is incorporated in each step is determined by a rule that systematically collapses the "relatedness tree" 5) After all sequences have been incorporated the MSA is finished
99
Database
a collection of electronically stored data Contains: The raw information itself (binary digits, numbers, text, etc), Info about the structure of different pieces of data, Info about the relationships between different pieces of data
100
Database Importance
Whenever data is accessed, generated or modified in one or more of the following ways a DBS can become valuable: Over long periods of time, By many different entities, With different permissions and ownerships In large volumes, Without adequate care the complexity arising from these scenarios can quickly start to cause problems such as data safety concerns, faulty data and physical disk space limitations Since any user can only interact with the database through the DBMS we can control operation son the data very carefully and minimize or prevent problems and unintended consequences Frequently encountered in biomedical research when storing high data volumes (NGS data, large scale projects,...), accessing experimental data from public databases (EMBL, GenBank, GEO, KEGG,...), accessing and recording PHI Technical knowledge about databases can help us to better interact with the various data sources, to better organize their own experimental data, and with designing experiments and data input forms that meet study requirements and data input forms that meet study requirements and for sufficient to answer research questions
101
Database Management System (DBMS)
the primary software system through which users create, modify and request info from the database
102
Database system (DBS)
the union of the database, the DBMS and other associated software that is interacting with the database NOTE:The DBS is often informally referred to as "the database"
103
Database language
a programming language that allows users to define operations that they would like the DBMS to perform NOTE: in order for a database language to be applicable to a DBS it has to be supported by the DBMS and assume the same structure as the database
104
Client-Server Architecture
there are several possible of these that a DBS can use (each with different strengths and weaknesses) Many modern architectures follow the "client-server model" Server side: a computer, computing cluster or cloud system that stores the database and runs the DBMS software Client side: an often remotely located computer or terminal communicating with the server side through a network or the internet A user uses this software to send requests to the server side. On the server the DBMS handles requests, communicates with the users and performs operations on the database
105
Relational databases
most common type of database, widely applied throughout all industries, serve as a great model to understand the challenges any type of modern database has to be able to address the benefits they provide
106
Structured query language (SQL)
the most commonly used database language for relational databases Examples of relational DBMS that primarily use SQL: "MS SQL Server", "IBM DB2", "Oracle", "MySQL", and "microsoft access"
107
Relational database model
``` Data is stored and presented as a collection of "relations" This model enables to create a wide array of different data structures including one-to-one, one-to-many and hierarchical data structures Relation: a table with a unique name that stores a set of tuples (rows) that share the same types of attributes (columns) Each relation has to contain primary keys → this is either an attribute or a combination of multiple attributes that uniquely identifies each tuple in the relation, the primary key of a specific tuple is often chosen to be a unique identification number or character string Each attribute (column) satisfies the following: has a name that is unique within the relation, has a single well defined domain/data type (number, text, object...), the position of the attribute does not carry info Each tuple (row) satisfies the following: carries info represented by values of attributes belonging to a specific entity, the position of the tuple does not carry information, each tuple is unique; there can be no duplicate tuples ```
108
Relational Database Model Rules
Non-primary key attribute cells may contain missing values (can denote this as NA which is also used in R to show missing data) Data relationships are represented by foreign keys (attributes that refer to primary keys in other tables) A relation may contain anywhere from non of up to arbitrarily many foreign keys If a foreign key value is not missing it has to reference an existing primary key value
109
Database relations versus views
Databases are constructed to minimize problems, optimize performance and save storage space This solves this problem: If the "orders" relation incorporated the full treatment and subject info for every single subject there would be a large amount of redundancy that unnecessarily increased required storage space By referring to treatments and subjects by foreign keys is a much more efficient use of physical disk space On the other hand, incorporating the full information about treatment and subject makes it easy for inconsistencies to arise (ex: if a subject's name was misspelled in one of the tuples in the table) However, often the corresponding tables representing the relations stored in a database are not of primary interest to end-users Most users tend to be interested in combinations of subsets of database relations
110
View
a table generated from one or more relations in the database, that contains some or all of their content Users can request views of the database from the DBMS, which will fetch and assemble the respective pieces of information from the raw database
111
Data Anomalies
problematic situations that can arise while a DBS is operating and info in the database is updated, these often occur due to flaws in the design of a database
112
Insertion Anomaly
a situation in which new data can only be added to the database by introducing missing values or otherwise it cannot be added at all Ex: is we want to add a new medical procedure that will be offered in the future to the database we have to do so by assigning missing values to subject into attributes
113
Deletion Anomaly
a situation in which deleting data that represents a certain type of info necessarily will also delete data that represents a different type of info Ex: if we want to remove a medical procedure that is no longer offered that info about all subjects that only receive this procedure will also be deleted NOTE: we want to delete the procedure but not the subjects
114
Modification/Update anomaly
a situation in which identical pieces of info are expressed in multiple tuples and updating only some of them leads to inconsistencies, it's hard to verify the real value of the data
115
Database Design Goal
It is able to store the desired info and represent all relevant relationships between different pieces of data To achieve this a database specialist first collaborates with a client or domain expert to create a draft of the database It is preventing serious complications before they can occur Achieved through database normalization in which the structure of the unnormalized database is modified step by step to meet more and more stringent criteria
116
First normal Form (1NF)
Specifies the following: every attribute is atomic (i.e. it contains a single value of a specified data type and no group/list of values) and this is already a formal requirement for relational databases To achieve: define a primary key consisting of two attributes Undesirable properties: certain attributes depend on others whereas others do not (this is a glaring source of redundancy and potential error)
117
Second Normal Form (2NF)
Specifies the following: all non-key attributes (in each relation) are functionally dependent on the entire primary key, the database satisfies 1NF NOTE: Attributes only depend on the primary key Three attributes (A, B, C) have a transitive dependency if A depends on B and B depends on C (source of redundancy and potential error
118
Third Normal Form (3NF)
Specifies the following: there are no transitive dependencies among attributes, the database satisfies 2NF Structure resembles the multiple tables presented in previous lecture NOTE: A relational database is considered normalised if it satisfies 3NF This is generally considered a well designed database Free of most anomalies There are also additional, more stringent normal forms Ex: BCNF (requires 3NF), 4NF (requires BCNF), 5NF (requires 4NF)
119
Benefits of normalization
- Faster turn-around for analysis personnel and collaborators (makes it easier to enter data into existing databases and to process data with analysis software) - Makes it more difficult for inconsistencies to occur - For tables that store raw data from your own experiment that will be submitted to a collaborator does the following: - Define/pick a column that serves as the primary key; the values in this column should be able to uniquely identify each row - If there is no obvious choice for a primary key, it might be a good idea to create unique ID numbers for each row in a table - If a raw data table contains sets of columns that represent different types of info that might be valuable to split the table apart according to these types, especially if they are referenced in other tables - Organize tables so there is no duplicated info appearing in multiple tables - Whether or not these objectives are worth the effort will depend on the number of affected tables and columns and also the total data volume
120
Public Bioinformatics Resources
Key factor leading to the birth of bioinformatics was the increased volume and complexity of data that biomedical researchers were faced with analyzing to answer their research questions since the late 20th century. Since inception, Bioinformatics has heavily utilized and benefitted from databases that catalogue and share experimental data Today large-scale public database systems containing info about "omics data" are a corner-stone of modern clinical and biological research Strengths: Facilitate fast access to stored info Provide many online tools specialized in performing comparisons and analyses by directly communicating with different databases
121
The National Center for Biotechnology Information (NCBI)
houses a wide array of frequently used online resources. All of the provided components can either be accessed through the search bar at the top or by navigating through the blues menu options on the left Examples: Medline: widely known in the biomedical field for its primary search engine "pubmed" The database of genotypes and phenotypes (dbGaP): contains "data and results from studies that have investigated the interaction of genotype and phenotype in humans dbSNP: contains "a broad collection of simple genetic polymorphisms (SNPs, small indels, MNPs) ClinVar: collects info about genetic variation and its relationship to human health GenBank Refseq The gene expression Omnibus database (GEO) Etc.
122
Why is NCBI such a powerful tool?
Its Entrez search system
123
Entrez
cross-database search system that allows to simultaneously search all public NCBI databases at once, allows for searching for both individual character strings and logical combinations of multiple character strings, offers filters and search refinement options to continuously narrow searches down to a desired target, incorporates many data visualization features, search queries are accessible through both web-interfaces and direct interfaces with programming languages
124
GenBank
an annotated collection of all publicly available DNA sequences, houses submissions made by various entities (including labs and sequencing projects), part of the international nucleotide sequence database collaboration (INSDC) which means that it daily receives data directly submitted to it and nucleotide data from the DNA databank of japan (DDBJ) and the european nucleotide archive (ENA), this is what the majority of searches will target when using the "nucleotide" search option of the Entrez search bar
125
RefSeq
owned by NCBI and not part of INSDC, short for reference sequence, aims to provide a non-redundant, curated data representing our current knowledge of known genes, houses well annotated reference sequences of genomes, transcripts and proteins
126
Gene Expression Omnibus (GEO)
originally designed to be a public data archive for high-throughput gene expression datasets obtained from microarray and RNA sequencing studies but later started to accept additional types of experimental data that relate to gene expression and gene regulation (DNA methylation, genome binding, protein profiling, chromosome conformation, genome copy number variation) Follows the MIAME and MISEQE guidelines which distil a study down to the minimum info required to describe it (MIAME = minimum info about a microarray experiment, MISEQE = minimum info about a next-generation sequencing experiment)
127
GEO2R
an interactive web tool that performs comparative analyses between groups of process samples using the "GEOquery" and "limma" R packages from the Bioconductor project
128
GEO interfacing capabilities with R
GEOquery allows users to directly download and import GEO datasets into R with a single command In combination with other Bioconductor packages (such as limma, DESeq2, etc.) this allows for creating complete analysis workflows pooling data form various studies natively within R
129
European Molecular Biology Laboratory (EMBL)
Europe's flagship laboratory for the life sciences" intergovernmental organization 27 supporting countries, most European, but Australia and Argentina also hold associate member status
130
European Bioinformatics Institute (EMBL-EBI)
part of EMBL that focuses on performing bioinformatics research develops and providing bioinformatics services and provides a large number of public databases and bioinformatics tools that can be accessed online via their website
131
Databases Associated with EBI
UniProt: information about protein sequences & biological function of proteins •PDBe: "the European resource for the collection, organization and dissemination of data on biological macromolecular structures" •Expression Atlas: large database holding information about gene expression signatures and differential expression •Ensembl •and many more ...
132
Online tools provided on EBI website
* ClustalOmega * BLAST * HMMER: alignments geared towards identifying homologous protein or nucleotide sequences * EnsemblGenome Browser * and many more ...
133
Ensembl genome database project
Project of both EBI and the Welcome Trust Sanger Institute •Public DBS containing curated genetic consensus information •Analogously to NCBI's RefSeq,it houses well annotated reference sequences of genomes, transcripts and proteins •"Automatically annotates genomes" of specific model organisms and "integrates this annotation with other available biological data" •High degree of transparency: all data and source code used to generate annotations is publicly available •Known for high quality annotations and the ENSEMBL genome browser
134
Ensembl IDs
identifiers used by Ensemblto refer to different types of information; can be understood as primary keys for their database entries; used in many studies outside of Ensembl •Homo sapiensformat: " ENS objectTypenumberString. version " •objectType: consists of multiple letters denoting different types of data •numberString: string containing numeric letters of various length •version: a version number NOTE: Both Ensembl and RefSeqIDs are frequently used. The same gene could be referred to with either type of ID
135
Genome Browsers
These browsers allow researchers to dynamically visualize genomes alongside with annotated data and they display nucleobase positions of chromosomes on the X-axis and use the Y-axis to display information as a function of base position. Annotations may include: •Genes (both validated and predicted) and transcript variants •Structural and regulatory elements (both validated and predicted) •Read counts, expression levels and methylation levels •Information about sequence variation in different populations •...and more
136
Weakness of genome browsers
They can be fairly overwhelming when starting to work with them (more or less intuitive based in the user's preferences)
137
Three most extensive genome browsers
Ensembl, NCBI GDV and UCSC Genome Browser
138
UCSC Genome Browser
is also hosted by the UCSC Genomics Institute•Both available as an online browser and a downloadable desktop application •Offers a wide variety of annotation tracks sourced from many different databases •Allows uploading data and displaying it in "custom tracks" in various formats