Bioinformatics Exam 3 Review Flashcards Preview

KCU COB > Bioinformatics Exam 3 Review > Flashcards

Flashcards in Bioinformatics Exam 3 Review Deck (138)
Loading flashcards...
1
Q

Programming

A

helps in collecting and manipulating data, automating analysis workflows (to show people what you did), minimizing human error and generating reproducible reports, quick processing of large datasets and repetitive tasks, visualizing and making sense of the data

2
Q

Programming language

A

letters/symbols create words according to rules, language for humans to formulate instructions for computers to generate some desired output, compiler and interpreter software allow an instruction formulated in a programming language to be translated into executable machine level operations.
Pathway: Instruction (in your mind) → instruction in programming language → instruction in machine level language → execution of instruction/computation → generated output

3
Q

Source Code

A

a set of instructions formulated in a programming language that is readable by humans

4
Q

Program

A

a set of instructions stored in a form that can be executed by a computer

5
Q

Compiler

A

a software that translates source code into a machine level program that is (usually) efficiently optimized for the machine it is compiled for
Time to translate → Slow
Time to execute → Fast

6
Q

Interpreter

A

translates source code scripts into machine level operations “on the fly” and executes them line by line
Time to translate → Fast
Time to execute → Slow

7
Q

1976

A

Chambers, Becker and Wilks develop the S statistical programming language at Bell laboratories
Aim: facilitate quick transitions from idea to software
This Interpreter based language allowed modifications, testing and trouble shooting of programs quick and convenient.

8
Q

1993

A

Ihaka and Gentleman re-implement S and Name it the “R programming language”

9
Q

1995

A

R is decided to be made freely available under the GNU General Public license (But not officially released)

10
Q

1997

A

R Core Group is founded and starts taking control of R’s further development, the Comprehensive R Archive Network (CRAN) is launched, enabling sharing and curation of user developed components that extends R’s capabilities

11
Q

2000

A

R version 1.0.0 is released to the general public

12
Q

2009

A

New york Times article: “Data Analysts Captivated by R’s Power”, Ashlee Vance
Good description of how R makes a difference → Daryl Pregibon (Google): “it allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems”

13
Q

2017

A

a study found that R has shown extreme growth

14
Q

2019

A

Another study found that R is the most requested programming language

15
Q

Comprehensive R Archive Network (CRAN)

A

a network of ftp and web servers storing versions of code and documentation for R. This serves as the main general purpose repository for R packages and if there is something common that is a common problem you can use a pre-made package to solve the answer to your problem.

16
Q

R

A

language and environment for statistical computing and graphics, open source language that is free, provides tools for statisticians, data miners, data analysts, data scientists and academic researchers

17
Q

Bioconductor

A

Another R package repository, free, dedicated to the analysis of genomic data and biological high-throughput assays, primary focus on an R package repository serving the needs of bioinformaticians and biomedical researchers
Packages available: >1800
Mission: accessibility of powerful analysis and visualization tools, reproducible research, rapid development of software components that are both scalable and compatible with each other

18
Q

Commands in R

A

R’s interpreter can process 2 forms of these → expressions and assignments, these can be separated by line-breaks or the “;” character, individual components within commands can be arbitrary separated by spaces and tabs

19
Q

Expressions

A

commands that are evaluated, printed (optional) and their output is lost, these take some input arguments or values and return some output values

20
Q

Operators

A

are generally expressed via 1 to 3 consecutive special characters and often handle fundamental, essential programming tasks, there are several other operators that handle tasks such as logic or comparison
Examples: ? opens a webpage with helpful documentation and explanations of a function

21
Q

Objects

A

individual pieces of data that have two major attributes,:
Data type:what type of information it contains
Value: the actual information that it contains
NOTE: internally the value of an object is just a bunch of zeros and ones in the memory of the computer the data type is what tells R how to interpret and display the value of the object.

22
Q

Scalar and multidimensional data types

A

the two fundamental classes of data types

23
Q

Character Objects

A

display letters, words and text, wrapped in quotation marks

24
Q

Logical Objects

A

only two possible values (yes and no/ true and false {abbreviated T and F}), used when you want to check or remember whether or not something is true or has happened when you run a program.

25
Q

Numerical Objects

A

integers and decimals

26
Q

Parenthesis

A

used to group arguments of expressions in conjunction with commas and can be used to control the order of operations in expressions, the expression enclosed within the innermost parentheses will always be evaluated first

27
Q

Assignment Commands

A

commands that evaluate an expression and store it, so that it can be accessed again in the future. These store objects in variables where the expression on the right-hand side can either be an R object or any type of valid expression that creates an R object and the left hand side is a variable that can be understood as a label or name that is attached to an object in order for R to remember it.

28
Q

R Console

A

an interactive interface for the command line interpreter (it comes up when you open R), commands can be typed into the console and executed by hitting the key

29
Q

Incomplete Commands

A

expressions that have not provided necessary right-hand side arguments or expressions that have not (yet) closed all their opened parentheses or brackets.

30
Q

Negation/ “NOT” operator

A

done with an exclamation point, this turns true into false and false into true (opposites)

31
Q

Logical “AND”

A

takes two logical expressions/variables and returns TRUE ONLY if BOTH are TRUE, otherwise it will return FALSE

32
Q

Logical “OR”

A

returns the expression as true if either the left or right is true. Otherwise it is returned as false
Operator: |

33
Q

“Equal to” Operator

A

takes R objects (or expressions generating an R object) and returns TRUE if both objects are identical to each other.
Operator: == (Two consecutive equal signs)

34
Q

“Not equal to” Operator

A

takes R objects (or expressions generating an R object) and returns TRUE, if both objects are not identical to each other
Operator: !=

35
Q

Inequality Operators

A

compare numerical objects to each other

Operators: < (less than), <= (less than or equal to), > (Greater than), >= (greater than or equal to)

36
Q

If blocks

A

how to make logical expressions useful, help us to execute pieces of code conditionally and react to different inputs/scenarios while the program is running
Syntax: if (condition){# conditional lines of code goes here}
Where “condition” is a logical expression or variable. If the condition is met the code block in curly brackets will be executed if it is not it will not be executed.

37
Q

If…else statements

A

allow us to conveniently cover two mutually exclusive cases (i.e. if one is true the other is false)
Syntax: if (condition){# if “condition” is TRUE then do…} else {# if “condition” is FALSE then do…}

38
Q

Vectors

A

creating and manipulating simple, ordered list of a specific scalar data type
Three components:
Scalar data type
Ordered cells
Values
Created through the concatenation function
All elements of this has to have the same data type (you CANNOT create a one of these that contains characters and numbers), it’s not possible to have a mixed type one of these

39
Q

Multi-dimensional Data types

A

more complex data types that are able to contain and arrange multiple scalars (vectors are the simplest of these types)

40
Q

Binary operators

A

require either both arguments are vectors of the same length or one of the two arguments is a single object (i.e. a vector of length one)
If different length → R will give a warning and “recycle” the shorter vector from the beginning to extend its length

41
Q

Subset operator

A

accepts vectors of logical expressions, logical vector has to be the same length as the vector we want to subset
accesses a specific element in the list and returns a new list with said element
Symbol: […]

42
Q

Matrix

A

works just like a vector but has 2 dimensions (cells have an x- and y-position), instead of just one

43
Q

Array

A

the abstraction of both vectors and matrices, is a multidimensional matrix with an arbitrary number assigned to user-specified dimensions
A 1-dimensional R ______ is equivalent to an R vector and a 2-dimensional R _______ is equivalent to an R matrix

44
Q

List

A

are an extension of the vector idea, this is a generic collection of R objects
Each element in ______ is an R object with an arbitrary data type and dimension (also allows different lengths as well as complex objects)
Useful to group various types of data that belong together (where they do not conveniently fit into a single table)
Can also use the assignment operator inside the ______ function to give elements names by which they can be accessed in the future
Syntax of ____ function:
_____( arg1, arg2, arg3…)

Extraction operator

45
Q

Extraction operator

A

will access a specific R object in the list and return the said object directly.
Symbol: [[…]] or $

46
Q

Wrappers

A

any entity that encapsulates another entity, an object that holds other objects or a “container” for other objects, lists are generic wrappers for R objects, probably wont work directly with lists in this course (lists allow us to form a general intuition about other generic _________)
Generic __________: Three types → S3, S4 and S5 objects

47
Q

Loops

A

help us perform repetitive tasks, reduce redundancy of code and reduce the amount of code required to perform a task. Note that R only has “for each” loops and “while” loops. The index variable can help us execute the same piece of code for different inputs.

48
Q

“For” loops

A

repeat something “N” times (we choose N)

Structure: for( i in 1:n){# repeat the following code…}

49
Q

“For each” loops

A

repeat something for each element in a set
Structure: for(i in elements){# repeat the following code…}
Header → defines an index variable (here named i) and an R vector of “elements” for each of which we want to repeat something
Body→ the code block that will be repeated
Can be read as for each “i” in a set of “elements”, do this
What it does (sequence of events):
Sets the index variable “i” to the 1st object in “elements”
Executes the code inside the body function
Sets the index variable “i” to the 2nd object in “elements”
Executes the code inside the body function
Repeats for all objects in “elements”

50
Q

Functions

A
important form of expression, take arbitrarily many arguments 0,1,2,3..., usually perform more complex tasks, return some desired output, R has pre-defined functions and allows users to create their own, after a fxn has been created it can be used as a shorthand to run the code encapsulated inside of it, NOTE: fxns have "local scope" whereas code outside of functions has "Global scope" (meaning variables inside of a fxn are created independently and separately from the R environment outside of the function), R will discard any assignments made inside of a function after it has been executed, when a fxn has multiple input arguments they will be assigned in order when the fxn is executed (NOTE: order can be arbitrarily changed if arguments are referred to by their names), default values for input arguments can be made by using the assignment operator next to input variables in the header line of the function
Syntax: nameOfFunction(arg1,arg2,arg3...)
my_function = function(arg1,arg2,arg3,...){
# function goes here...
}
Header: Defines the name of the function on the left of the assignment operator and the names of input variables/arguments of the function (these variable names are provided in the parenthesis)
Body: the code block that will be executed when "my_function" is used
51
Q

Matrix function

A

my_matrix = matrix(
Data = ?, nrow = ?, byrow = ?
)
Input arguments:
“Data” → a vector of values or objects that will be put into the cells of the matrix
“nrow” → the number of rows in the matrix
“ncol” → the number of columns in the matrix
“byrow” → logical, if TRUE values populate the table they populate in order via row-by-row; if FALSE values populate the table they populate in order via column-by-column

52
Q

cbind(A,B)

A

aka column bind, stitch A and B together into a single matrix such that the columns in B follow the right of the columns in A

53
Q

Subset

A

the ________ operator can also be used in combination with the assignment operator to modify specific elements inside of the matrix, a matrix can be subset and accessed using the _________ operator [rows,columns] several different ways

54
Q

Data Frames

A
the data frame is R's preferred data type to represent R X C data tables, like a matrix it has R rows and C columns and it supports all subset operations [A,B] that matrices have access to, in contrast each column has its own associated data type and each column can have a different data type (this offers a lot of additional functionality)
my_tab = data.frame(
column1 = c(...),
column2 = c(...),
Column3 = c(...),
...
)
55
Q

Reading

A

he process of loading a file into memory
Fxn: read.table, this fxn reads a simple text file containing table data into R and turns it into a data.frame object, this function expects a file to contain a plain text such that each line in the file represents the row of a table and columns are separated by a special character (usually a space, “” or tab “\t” character), in the “people.txt” file columns are separated by the “” character
Syntax: my_tab = read.table (
file = “…”, header = TRUE, sep = “…”,
)

56
Q

Writing

A

the process of saving data currently in a computer’s memory to a specific file
Fxn: write.table, can be used to write data frame objects to simple text files and to add a new column in a dataset and saves the modified table into a separate file
NOTE: omit quotes around objects and do not include row names as a separate column

57
Q

Installing and loading R packages

A

R packages that were uploaded to CRAN can be installed via the “install.packages” fxn which accepts a package name in the form of a character object
After the package is installed it can be loaded into R by using the “library” fxn which accepts a package name in the form of a character object
Loading a package makes new commands available to the R user which are most commonly implemented as fxns
For CRAN packages info on what’s included in a package and how to used it can be found via: https://cran.r-project/web/packages/packagenamegoeshere/
For a specific example consult the lecture slides

58
Q

Random Variable

A

a mathematical object that represents the random process behind a measurement, it can realize/assume/generate different values with different probabilities, when an experimental measurement is taken the resulting value is considered a realization of a random variable, parameters of a distribution represent truths/properties about the random process that generates the outcome data, this means that questions about outcome variables and the process that created them can generally be formulated in terms of questions about distributional parameters
Events in sample space are mapped to real numbers and then assigned probabilities
accomplish the following tasks:
Transfer questions about real life events into questions about numbers (allow us to evaluate and manipulate any type of probabilistic event with mathematical language and powerful mathematical tools)
Provide a unified framework in which we can gain insights about properties and consequences of random processes

59
Q

Probability Distribution

A

the probability distribution of an RV assigns probabilities to sets of events in the sample space (i.e. the set of all possible events), exact probability distribution of a random variable is unknown (based on knowledge and assumptions about sample space and the data generating process we can assign a family of distributions to random variables)

60
Q

Probability density function (pdf)

A

is a component of probability distribution, it can be loosely understood as the function that takes an individual event from a sample space and returns its associated probability

61
Q

Families of distributions

A

have a pdf that contains some unknown parameters (choosing numbers for these parameters will create a valid example of a specific distribution that is the respective family)

62
Q

Bernoulli distribution family

A

any random process with only two possible outcomes which are denoted as X = 0 and X = 1
Form of the probability density: f(x) = p^x(1-p)^1-x
p = parameter representing the probability that X = 1 occurs

63
Q

Exponential distribution family

A

a random process in which we measure the time X between some events that occur independently but on average at the same constant rate
Form of probability density: f(x) = 𝝀e^-𝝀x
𝝀 = parameter representing the constant rate at which events occur on average

64
Q

Distributions with discrete RVs

A

can assume a countable number of values

65
Q

Distributions of continuous RVs

A

can assume values in a continuum of numbers (not countable)

66
Q

Random sample

A

a collection of random variables that are independent but share the same distribution

67
Q

Expected Value

A

the average of all possible values X, each weighted by their respective probability

68
Q

Variance

A

measure the spread of X around its mean (i.e. central value), variance has a close connection to uncertainty (if large it’s scattered if small its not)

69
Q

Negative Binomial distribution

A

Assumes: the response variables are discrete counts, overdispersion of outcomes (i.e. their variances tend to be larger than their expected values, predominantly the case in gene expression data)
Allows: to factor sample signal intensities into the probability distribution of counts

70
Q

Generalized Linear Model (glm)

A

a modelling framework suitable to investigate the effects of predictor variables on outcomes following a negative binomial distribution
Are able to perform estimation and hypothesis test for commonly encountered probability distributions
Can handle outcome variables that are discrete or continuous and that are constrained to specific intervals
GLMs → express some fxn of the mean response to be a linear combination of predictors
g(E[Y]) = alpha + (Beta1) (X1) + (Beta2)(X2)…

71
Q

Linear models

A

express the mean of the response as a linear combination of predictors:
E[Y] = alpha + (Beta1) (X1) + (Beta2)(X2)…
Beta = coefficient
X = predictor

72
Q

Quantifying DNAm

A

quantify via beta values (can be obtained from DNAm microarrays), M = methylated/blue/high beta value, U = unmethylated/yellow/low beta value and black = in between

73
Q

Deconvolution

A

estimating cell type proportions for each sample by using their methylation beta values in the reference set and by using the reference signatures for each cell type
Assumption → for each cell type the unknown beta values of the reference set of CpGs are approximately stable in our sample population if reference signatures (beta values) for each cell type are already known from previous studies

74
Q

Beta value approach

A

one of two strategies for analyzing data, are bounded between 0 and 1, their variance tends to change in different sub-populations, popular approach, ignore issues and fit a linear model assuming beta values follow an approximately normal distribution
Debate → is this really the best statistical model to analyze the data?
Outcome: Yj
Outcome following normal distribution? → hard to justify, generally not satisfied
Biological interpretation → easy to interpret, biologically meaningful
Performance → linear model identifies substantial differences well but p-values are questionable
Graph → skewed right

75
Q

M-value approach

A

second of two strategies for analyzing data, transforms beta values to so called m-values which makes them behave more like a normal distribution THEN they fit the m-values into a linear model
Outcome: mj =log (base 2) (Yj/1-Yj)
Outcome follows normal distribution? → yes, approximates outcome well
Biological interpretation → difficult
Performance → linear model performs well in most cases
Graph → slightly skewed left

Operational

76
Q

Operational Taxonomic Units (OTUs)

A

are groups of closely related species and are based on taxonomy. They can be counted and analyzed at different levels of hierarchy
Higher level → includes more species, is more accurate and less specific
Lower level → includes less species, is less accurate and less specific
Common result → OTU count tables
Compositional → total sample counts are arbitrarily fixed and differ
Rows = genus
Columns = samples
# in cell = how often an OTU was observed in a given sample

77
Q

Beta diversity

A

pairwise differences in diversity between samples
Ex → patterns of OTU counts are more similar among samples within group A and within group B than when comparing a sample from group A to a sample from group B

78
Q

Unweighted UniFrac

A

Measures the proportion of genetic change (in sample composition) that is unique to the evolution of either sample
Is only concerned with how the genetic content of two samples changes with respect to presence or absence of species/OTUs
OTUs with non-zero counts in sample A tend to have a variety of different genetic features than OTUs with non-zero counts in sample B? → large distance

79
Q

Weighted UniFrac

A

How much is genetic change (in sample composition) associating with differences in relative abundance between two samples?
Highly abundant OTUs in sample A have very different genetic features than highly abundant OTUs in sample B? → large distance

80
Q

Rarefaction

A

treats each sample as a bag and each observed OTU count as a colored marble then we randomly draw marbles from these bags to adjust for sequencing depth
We choose a total ___________count (R) that is smaller than the smallest observed total sample count for each sample → then we randomly draw R OTUs from the set of observed OTUs in the respective sample without replacement → as a result each OTU will receive a new adjusted count to that the total sample count (R) and thus the underlying sequencing depth in each sample is fixed to the same value

81
Q

PERMANOVA

A

tests for associations of 2-dimensional data matrices (samples are columns and rows are multiple outcome variables) with predictor variables based on dissimilarity scores
Compares two quantities:
Within-group variation (Sw) → sum of squared dissimilarity scores between subjects with the same predictor values
Between-group variations (SB) → sum of squared dissimilarity scores between subjects with different predictor values

82
Q

Read

A

reads are the primary output of most next gen. Seq. they are a short sequence of letters representing a subsequence of ribonucleotides which was observed in a given sample.

83
Q

Pairwise sequence alignments

A

the process of arranging two sequences of symbols in such a way that exposes their regions of similarity (usually use a reference genome to ID where it originated from

84
Q

Expression

A

they are expressed as instructions on how to modify sequence 1 symbol by symbol to turn it into sequence 2 and vice versa
| = a match
* = substitution
_ = deletion/insertion

85
Q

heuristic alignments

A

trade accuracy for speed and often rely on assumptions about genomic structure

86
Q

exact alignments

A

always guaranteed to find the optimal alignment of 2 sequences, but can be very slow when one or both of the two sequences are very large
Types of exact pairwise alignment → perfect sequence matching, global alignment, local alignment

87
Q

Perfect sequence matching

A

only interested in finding out whether a target shorter sequence is a perfect subsequent sequence of another longer sequence (if shorter sequence is different by a single symbol from the longer sequence we do not consider them a match)

88
Q

Global Alignment

A

aka “End-to-end” alignments, impose that all symbols in both sequences must be incorporated into the alignment, allow for subsequences of poor similarity to be omitted, this is a powerful tool to contrast differences between two sequences of roughly similar length. These struggle with identifying regions of high similarity originating from two sequences that except for those regions are otherwise very different from each other AND they struggle with identifying the optimal region in a very long sequence to which a small sequence maps or aligns with high similarity

89
Q

Needleman-Wunsch algorithm

A

gold standard (for global alignments), guaranteed to find optimal global alignments while also achieving efficient run-times due to a dynamic programming strategy, still widely used in modern research, “high-tech” but “easy” to implement and understand

90
Q

Local Alignment

A

aim to ID the most similar regions within sequences, they dynamically decide how large subsequences within the two target sequences should be in order to maximize local similarity , they zoom in to one part of the first sequence and one part of the 2nd alignment to find matching sequences that are similar enough.

91
Q

Smith-Waterman (SW) algorithm (1988)

A

gold standard for local alignments, guaranteed to ID the optimal local alignment and is a direct extension of the needleman-wunsch (only needs a few small adjustments)

92
Q

Changes made to NW to make SW

A

1) Initialize the first row and first column of the dpm with the value 0 in each cell
2) Initialize the first row and first column of the pointer matrix with the value 8 in each cell (the pointer value 8 = “no-previous-position” and if the pointer value contains the value 8 then the optimal path traversing position (i,j) does not have a previous position, in other words any optimal path including position (i,j) has to start at said position)
3) When moving through the dpm step by step and calculating scores for dpm[i,j] a third option is added on top of coming from the diagonal, left and up direction
4) If the best possible similarity score of these 3 directions is less than 0 then we assign dpm [i,j] to be 0 and save the value 9 at coordinate (i,j) of the pointer matrix

93
Q

Things the SW accomplishes

A

Terminate a path at any point if moving forward can only lead through regions with a large number of mismatches
Allow optimal paths to start at any appropriate point in the dpm if doing so maximizes local similarity
Allow multiple optimal paths with the same degree of similarity to be identified at the same time

94
Q

differences in local vs global alignment

A

1) Any alignment can be represented as a path through the dpm
2) In contrast to global alignments and an alignment of two sub-sequences doesn’t start at the top left corner (but a different cell) and doesn’t end at the bottom right corner (but at a different cell)

95
Q

Scoring Systems

A

both global and local alignments can extend and adjust their scoring systems to match other research settings and ID different types of similarity patterns
In aligning AA sequences to evaluate proteins it is no longer meaningful to just consider matches and mismatches
Why → there are a lot of ways to mismatch or substitute symbols for each other, certain substitutions correspond to more similar sequences than others (either evolutionarily or structurally) and certain matches correspond to higher similarity than others
TAKE HOME MESSAGE: before performing pairwise alignments researchers should carefully consider the alignment algorithm and scoring system and base their choices on best practice guidelines to made sure they are identifying the correct types of similarity patterns

96
Q

Multiple Sequence Alignment (MSAs)

A

are very powerful (are able to relate two sequences to each other that would otherwise appear to be very different through their shared relatedness to other sequences), compares multiple (3 or more) sequences at the same time, are able to ID subsequences that are highly similar among multiple candidates in the set of aligned sequences and often try to ID a consensus sequence among a set of related sequences (the consensus sequence is constructed so that on average all aligned sequences show the smallest difference from it which allows researchers to analyze variability of individual sequences from a shared group/family consensus)
Uses: to compare DNA, RNA or protein sequences that are suspected to be evolutionarily related (help quantify their degree of relatedness), identify homologous genes that trace back to a common ancestor and identify conserved elements in genomic regions (for example: transcription factor binding sites)

97
Q

Exact global MSAs

A

can be obtained through a dynamic programming strategy (analogously to the NW algorithm) but instead of using a 2-dimensional dpm this strategy uses a M-dimensional dpm when aligning M sequences and as a consequence exact global alignments become very slow when M is moderately large or the aligned sequences are moderately long
NOTE: in most research scenarios heuristic MSA methods are the only feasible option and there are many different types of heuristic MSA methods (each relies on different assumptions and incorporates different pieces on external info)

98
Q

Progressive MSA algorithms

A

most commonly used strategy in research
Examples: ClustalW, GlustalOmega, T-Coffee, PSAlign
Steps → they work by performing the following 5 steps
1) Starts by performing all possible combinations of pairwise alignments between the to be compared sequences
2) Based on the pairwise degrees of similarity sequences are organized into a tree that represents their relatedness (where the closer they are the more related they are)
3) Starting with the alignment of the two most similar sequences, the MSA is then progressively updated by incorporating one sequence at a time
4) Which sequence is incorporated in each step is determined by a rule that systematically collapses the “relatedness tree”
5) After all sequences have been incorporated the MSA is finished

99
Q

Database

A

a collection of electronically stored data
Contains: The raw information itself (binary digits, numbers, text, etc), Info about the structure of different pieces of data, Info about the relationships between different pieces of data

100
Q

Database Importance

A

Whenever data is accessed, generated or modified in one or more of the following ways a DBS can become valuable: Over long periods of time, By many different entities, With different permissions and ownerships
In large volumes, Without adequate care the complexity arising from these scenarios can quickly start to cause problems such as data safety concerns, faulty data and physical disk space limitations
Since any user can only interact with the database through the DBMS we can control operation son the data very carefully and minimize or prevent problems and unintended consequences
Frequently encountered in biomedical research when storing high data volumes (NGS data, large scale projects,…), accessing experimental data from public databases (EMBL, GenBank, GEO, KEGG,…), accessing and recording PHI
Technical knowledge about databases can help us to better interact with the various data sources, to better organize their own experimental data, and with designing experiments and data input forms that meet study requirements and data input forms that meet study requirements and for sufficient to answer research questions

101
Q

Database Management System (DBMS)

A

the primary software system through which users create, modify and request info from the database

102
Q

Database system (DBS)

A

the union of the database, the DBMS and other associated software that is interacting with the database
NOTE:The DBS is often informally referred to as “the database”

103
Q

Database language

A

a programming language that allows users to define operations that they would like the DBMS to perform
NOTE: in order for a database language to be applicable to a DBS it has to be supported by the DBMS and assume the same structure as the database

104
Q

Client-Server Architecture

A

there are several possible of these that a DBS can use (each with different strengths and weaknesses)
Many modern architectures follow the “client-server model”
Server side: a computer, computing cluster or cloud system that stores the database and runs the DBMS software
Client side: an often remotely located computer or terminal communicating with the server side through a network or the internet
A user uses this software to send requests to the server side. On the server the DBMS handles requests, communicates with the users and performs operations on the database

105
Q

Relational databases

A

most common type of database, widely applied throughout all industries, serve as a great model to understand the challenges any type of modern database has to be able to address the benefits they provide

106
Q

Structured query language (SQL)

A

the most commonly used database language for relational databases
Examples of relational DBMS that primarily use SQL: “MS SQL Server”, “IBM DB2”, “Oracle”, “MySQL”, and “microsoft access”

107
Q

Relational database model

A
Data is stored and presented as a collection of "relations"
This model enables to create a wide array of different data structures including one-to-one, one-to-many and hierarchical data structures
Relation: a table with a unique name that stores a set of tuples (rows) that share the same types of attributes (columns)
Each relation has to contain primary keys → this is either an attribute or a combination of multiple attributes that uniquely identifies each tuple in the relation, the primary key of a specific tuple is often chosen to be a unique identification number or character string
Each attribute (column) satisfies the following: has a name that is unique within the relation, has a single well defined domain/data type (number, text, object...), the position of the attribute does not carry info
Each tuple (row) satisfies the following: carries info represented by values of attributes belonging to a specific entity, the position of the tuple does not carry information, each tuple is unique; there can be no duplicate tuples
108
Q

Relational Database Model Rules

A

Non-primary key attribute cells may contain missing values (can denote this as NA which is also used in R to show missing data)
Data relationships are represented by foreign keys (attributes that refer to primary keys in other tables)
A relation may contain anywhere from non of up to arbitrarily many foreign keys
If a foreign key value is not missing it has to reference an existing primary key value

109
Q

Database relations versus views

A

Databases are constructed to minimize problems, optimize performance and save storage space
This solves this problem: If the “orders” relation incorporated the full treatment and subject info for every single subject there would be a large amount of redundancy that unnecessarily increased required storage space
By referring to treatments and subjects by foreign keys is a much more efficient use of physical disk space
On the other hand, incorporating the full information about treatment and subject makes it easy for inconsistencies to arise (ex: if a subject’s name was misspelled in one of the tuples in the table)
However, often the corresponding tables representing the relations stored in a database are not of primary interest to end-users
Most users tend to be interested in combinations of subsets of database relations

110
Q

View

A

a table generated from one or more relations in the database, that contains some or all of their content
Users can request views of the database from the DBMS, which will fetch and assemble the respective pieces of information from the raw database

111
Q

Data Anomalies

A

problematic situations that can arise while a DBS is operating and info in the database is updated, these often occur due to flaws in the design of a database

112
Q

Insertion Anomaly

A

a situation in which new data can only be added to the database by introducing missing values or otherwise it cannot be added at all
Ex: is we want to add a new medical procedure that will be offered in the future to the database we have to do so by assigning missing values to subject into attributes

113
Q

Deletion Anomaly

A

a situation in which deleting data that represents a certain type of info necessarily will also delete data that represents a different type of info
Ex: if we want to remove a medical procedure that is no longer offered that info about all subjects that only receive this procedure will also be deleted
NOTE: we want to delete the procedure but not the subjects

114
Q

Modification/Update anomaly

A

a situation in which identical pieces of info are expressed in multiple tuples and updating only some of them leads to inconsistencies, it’s hard to verify the real value of the data

115
Q

Database Design Goal

A

It is able to store the desired info and represent all relevant relationships between different pieces of data
To achieve this a database specialist first collaborates with a client or domain expert to create a draft of the database
It is preventing serious complications before they can occur
Achieved through database normalization in which the structure of the unnormalized database is modified step by step to meet more and more stringent criteria

116
Q

First normal Form (1NF)

A

Specifies the following: every attribute is atomic (i.e. it contains a single value of a specified data type and no group/list of values) and this is already a formal requirement for relational databases
To achieve: define a primary key consisting of two attributes
Undesirable properties: certain attributes depend on others whereas others do not (this is a glaring source of redundancy and potential error)

117
Q

Second Normal Form (2NF)

A

Specifies the following: all non-key attributes (in each relation) are functionally dependent on the entire primary key, the database satisfies 1NF
NOTE:
Attributes only depend on the primary key
Three attributes (A, B, C) have a transitive dependency if A depends on B and B depends on C (source of redundancy and potential error

118
Q

Third Normal Form (3NF)

A

Specifies the following: there are no transitive dependencies among attributes, the database satisfies 2NF
Structure resembles the multiple tables presented in previous lecture
NOTE:
A relational database is considered normalised if it satisfies 3NF
This is generally considered a well designed database
Free of most anomalies
There are also additional, more stringent normal forms
Ex: BCNF (requires 3NF), 4NF (requires BCNF), 5NF (requires 4NF)

119
Q

Benefits of normalization

A
  • Faster turn-around for analysis personnel and collaborators (makes it easier to enter data into existing databases and to process data with analysis software)
  • Makes it more difficult for inconsistencies to occur
  • For tables that store raw data from your own experiment that will be submitted to a collaborator does the following:
  • Define/pick a column that serves as the primary key; the values in this column should be able to uniquely identify each row
  • If there is no obvious choice for a primary key, it might be a good idea to create unique ID numbers for each row in a table
  • If a raw data table contains sets of columns that represent different types of info that might be valuable to split the table apart according to these types, especially if they are referenced in other tables
  • Organize tables so there is no duplicated info appearing in multiple tables
  • Whether or not these objectives are worth the effort will depend on the number of affected tables and columns and also the total data volume
120
Q

Public Bioinformatics Resources

A

Key factor leading to the birth of bioinformatics was the increased volume and complexity of data that biomedical researchers were faced with analyzing to answer their research questions since the late 20th century.
Since inception, Bioinformatics has heavily utilized and benefitted from databases that catalogue and share experimental data
Today large-scale public database systems containing info about “omics data” are a corner-stone of modern clinical and biological research
Strengths:
Facilitate fast access to stored info
Provide many online tools specialized in performing comparisons and analyses by directly communicating with different databases

121
Q

The National Center for Biotechnology Information (NCBI)

A

houses a wide array of frequently used online resources. All of the provided components can either be accessed through the search bar at the top or by navigating through the blues menu options on the left
Examples:
Medline: widely known in the biomedical field for its primary search engine “pubmed”
The database of genotypes and phenotypes (dbGaP): contains “data and results from studies that have investigated the interaction of genotype and phenotype in humans
dbSNP: contains “a broad collection of simple genetic polymorphisms (SNPs, small indels, MNPs)
ClinVar: collects info about genetic variation and its relationship to human health
GenBank
Refseq
The gene expression Omnibus database (GEO)
Etc.

122
Q

Why is NCBI such a powerful tool?

A

Its Entrez search system

123
Q

Entrez

A

cross-database search system that allows to simultaneously search all public NCBI databases at once, allows for searching for both individual character strings and logical combinations of multiple character strings, offers filters and search refinement options to continuously narrow searches down to a desired target, incorporates many data visualization features, search queries are accessible through both web-interfaces and direct interfaces with programming languages

124
Q

GenBank

A

an annotated collection of all publicly available DNA sequences, houses submissions made by various entities (including labs and sequencing projects), part of the international nucleotide sequence database collaboration (INSDC) which means that it daily receives data directly submitted to it and nucleotide data from the DNA databank of japan (DDBJ) and the european nucleotide archive (ENA), this is what the majority of searches will target when using the “nucleotide” search option of the Entrez search bar

125
Q

RefSeq

A

owned by NCBI and not part of INSDC, short for reference sequence, aims to provide a non-redundant, curated data representing our current knowledge of known genes, houses well annotated reference sequences of genomes, transcripts and proteins

126
Q

Gene Expression Omnibus (GEO)

A

originally designed to be a public data archive for high-throughput gene expression datasets obtained from microarray and RNA sequencing studies but later started to accept additional types of experimental data that relate to gene expression and gene regulation (DNA methylation, genome binding, protein profiling, chromosome conformation, genome copy number variation)
Follows the MIAME and MISEQE guidelines which distil a study down to the minimum info required to describe it (MIAME = minimum info about a microarray experiment, MISEQE = minimum info about a next-generation sequencing experiment)

127
Q

GEO2R

A

an interactive web tool that performs comparative analyses between groups of process samples using the “GEOquery” and “limma” R packages from the Bioconductor project

128
Q

GEO interfacing capabilities with R

A

GEOquery allows users to directly download and import GEO datasets into R with a single command
In combination with other Bioconductor packages (such as limma, DESeq2, etc.) this allows for creating complete analysis workflows pooling data form various studies natively within R

129
Q

European Molecular Biology Laboratory (EMBL)

A

Europe’s flagship laboratory for the life sciences”
intergovernmental organization
27 supporting countries, most European, but Australia and Argentina also hold associate member status

130
Q

European Bioinformatics Institute (EMBL-EBI)

A

part of EMBL that focuses on performing bioinformatics research
develops and providing bioinformatics services and provides a large number of public databases and bioinformatics tools that can be accessed online via their website

131
Q

Databases Associated with EBI

A

UniProt: information about protein sequences & biological function of proteins
•PDBe: “the European resource for the collection, organization and dissemination of data on biological macromolecular structures”
•Expression Atlas: large database holding information about gene expression signatures and differential expression
•Ensembl
•and many more …

132
Q

Online tools provided on EBI website

A
  • ClustalOmega
  • BLAST
  • HMMER: alignments geared towards identifying homologous protein or nucleotide sequences
  • EnsemblGenome Browser
  • and many more …
133
Q

Ensembl genome database project

A

Project of both EBI and the Welcome Trust Sanger Institute
•Public DBS containing curated genetic consensus information
•Analogously to NCBI’s RefSeq,it houses well annotated reference sequences of genomes, transcripts and proteins
•”Automatically annotates genomes” of specific model organisms and “integrates this annotation with other available biological data”
•High degree of transparency: all data and source code used to generate annotations is publicly available
•Known for high quality annotations and the ENSEMBL genome browser

134
Q

Ensembl IDs

A

identifiers used by Ensemblto refer to different types of information; can be understood as primary keys for their database entries; used in many studies outside of Ensembl
•Homo sapiensformat: “ ENS objectTypenumberString. version “
•objectType: consists of multiple letters denoting different types of data
•numberString: string containing numeric letters of various length
•version: a version number
NOTE: Both Ensembl and RefSeqIDs are frequently used. The same gene could be referred to with either type of ID

135
Q

Genome Browsers

A

These browsers allow researchers to dynamically visualize genomes alongside with annotated data and they display nucleobase positions of chromosomes on the X-axis and use the Y-axis to display information as a function of base position.

Annotations may include:
•Genes (both validated and predicted) and transcript variants
•Structural and regulatory elements (both validated and predicted)
•Read counts, expression levels and methylation levels
•Information about sequence variation in different populations
•…and more

136
Q

Weakness of genome browsers

A

They can be fairly overwhelming when starting to work with them (more or less intuitive based in the user’s preferences)

137
Q

Three most extensive genome browsers

A

Ensembl, NCBI GDV and UCSC Genome Browser

138
Q

UCSC Genome Browser

A

is also hosted by the UCSC Genomics Institute•Both available as an online browser and a downloadable desktop application
•Offers a wide variety of annotation tracks sourced from many different databases
•Allows uploading data and displaying it in “custom tracks” in various formats