HC 9 - Information Management: Public Biological Databases Flashcards

hoorcollege 9 (54 cards)

1
Q

Why is it important to visualize and store knowledge and data structural?

A

-Reuse
-New hyptheses and experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of databases for omics and clinical

A

In-house and public

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are data good value for money, but difficult to establish and fund

A

Databases ensure expensive data is not lost (expensive to overdo experiments) > but hard to fund, because no new information/insights
-Database maintenance costs money

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Information management

A

-Collection and storage
-Management > organisation, annotation, curation, integration
-Standardization > minimum information standards, FAIR principle
-Distribution and sharing > you cannot just reach human data (privacy)
-GDPR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a database?

A

-Computational archive to store and organize data > to easily query/retrieve data
-Consists of hardware and software
-To organize data in structured records
-Allows discovery of new information (data mining, machine learning, artificial intelligence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

FAIR principle

A

-Findable
-Accessible
-Interoperable
-Reuseable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Query

A

Ask to a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For structure, what is important for a database?

A

Not different kinds of information in a column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why are commas important in databases?

A

They are separators for columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Each field consists a maximum of … categories of data

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Relational database

A

Linked tables
> searching for certain labels which appear in the different tables
> linkage due to similar labels
> more complex databases like Gene Ontology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

types of databases

A

Primary
-Consist of raw data
-Sequences, structures
Secondary (composite)
-Data from analysis or treatment of the primary data
-Protein families, metabolic pathways
Literature database
-not biological
-pubmed/ncbi
Online book, like GeneReviews
-expert authored
-peer-reviewed disease descriptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Non-redundant database

A

All sequences which are uploaded for a gene reduced to one consensus sequence
> like RefSeq: one sequence per gene
> search on subset
> duplicate entries are removed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Redundant

A

Multiple sequences for the same gene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The database storage doubles every …. because:

A

Every 18 months > queries need to be repeated.
- Therefore, check database regularly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Most searched organisms in GenBank are ..

A

Mammalians, model organisms, fish, plants, microbes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Types of queries in GenBank and results

A

-Search sequence by name
-Search sequence by similarity: enter sequence
Results
-Accession code
-Version
-Unique identifier
-Comment (free text)
-Sequence features
-Links to other database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The primary accession code is …

A

a unique unchanging identifier assigned to each GenBank sequence record: used when citing information from GenBank
> can be used for other databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Publication rules for scientific journals

A

-Describe where the data is found in the database
> Data deposit (the sequence)
Because
-Analysis can be validated by other researchers with possible new methods
-But sometimes the description is incomplete or it is already processed and not raw data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

FTP client

A

For downloading files from a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

E-utilities

A

-Set of 8 server side programs
-Interface in Entrez query and database system NCBI
> can be used by typing certain URL format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to find public databases?

A

-Large database providers: NCBI, EBU, SIB
-Nucleic Acids Research
-GeneCards
-Wikipedia list of biological databases

23
Q

Criteria for inclusion in NAR

A

-Thoroughly curated
-Of interest to wide variety biologists
-Comprehensiveness of coverage
-Degree of added value (because of manual curation)
-Maintained for long period of time

24
Q

XML output

A

Processing by computer to make data readable

25
What is important to check with a database for reliability?
When was it last updated?
26
Parts of databases
-Raw Data > human genome (Ensembl), protein expression (gene expression omnibus), protein sequences (UniProt), protein structures (PDB), compounds (ChEBI) -High level databases > pathways (Reactome), protein interactions (String), Mendelian inheritance (OMIM)
27
GeneCards is a ...
Hub to other databases: easily retrieve information about specific genes and proteins
28
Minimal Information Standards: what is MIBBI?
Minimum Information for Biological and Biomedical Investigations
29
Six most critical elements contributing towards MIAME
>Data -Raw data for each hybridization -Final processed normalized data for set of hybridizations in study >Metadata -The essential sample annotation including experimental factors and values (e.g. compound and dose) -The experimental design incl sample data relationships -Sufficient annotation of the array -Essential laboratory and data processing protocols
30
Which well known database encourages submitters to supply MIAME compliant data?
Gene Expression Omnibus (GEO)
31
GEO submission:
-Producent (platform description) -Raw data fill in -How do the samples relate to each other >GEO groups experiments
32
Why the FAIR data principles?
To enhance reuseability of data > specific emphasis on enhancing ability of computers to automatically find and use data
33
FAIR data principles: Findable
It should be easy to find by human or computer and based on metadata description > (meta)data should be assigned to global unique identifiers > data is described by rich metadata > Metadata clearly and explicitly include the identifier of the data it describes > (meta) data are registered or indexed in a searchable resource
34
FAIR: accessable
-(meta)data are retrievable by identifier using standardized communication protocol >protocol is open, free and universally implementable >protocol allows for authentication and authorisation when required -Metadata should be accessible even when data is no longer available: ask authors for data with email
35
Human data should be accessible with the use of:
contracts for privacy, makes research harder
36
FAIR: interoperable
can be combined with other datasets and computersystems - data interoperation is a non-trivial problem and the "I" will require the most creative effort -(meta)data use formal, accessible, shared and broadly applicable language for knowledge representation
37
FAIR: reuseable
For use in future research and further processing - good annotation of which processing steps have been done with the data to come to conclusions
38
Challenge of nomenclature
How do you assign and maintain correct names of biological objects across databases > gene may have several alternative names/symbols > gene names are not always consistently used for different organisms
39
Example of curation, annotation and provenance: Biocuration
Manually checking and correcting -interpretation and integration of information relevant for biology -goals >accurate and reliable representation of biological knowledge, easily accessible and base for computional analysis
40
UniProt consists of....
-Swissprot (manually reviewed) -Trembl (automatically reviewed)
41
Gene Ontology
Common language for annotation of genes
42
GO objectives
-Represent categories used to classify specific parts of our biological knowledge: biological process (network/pathway), molecular function (activity, function), cellular component (location, in which complex) -Common knowledge applicable to any organism -GO terms for gene annotation of any species > comparison across species
43
Examples of the three Gene Ontologies
-Molecular Funciton: carbohydrate binding, ATPase activity -Biological Process: mitosis. purine metabolism -Cellular component: nucleus, telomere, RNA pol II holoenzyme
44
GO-terms, GO-id, definitions
Term: short, like DNA binding transcription factor activity GO-id: id and molecular function Definition: description as text
45
Ontology
-Vocabulary of terms -Definitions -Defined logical relationships to each other
46
Ontology (network) structure
Nodes: terms in ontology Edges: relationships between concepts
47
Kinds of relationships in GO structure: hierarchical directed acyclic graph (arrows from specific to less specific)
-is a -part of
48
Ontology is the representation of something
We know about
49
TCGA
Cancer genome atlas -public and omics data -sign contracts for use of data
50
OMIM
Knowledgebase of human genes and phenotypes > Mendelian Inheritance > consequences of types of mutations
51
Database consistency: differences between databases
-Overlapping / complementary content -Databases reflect expertise and interest of the groups that maintain them -developed or different application
52
Inconsistencies between databases
-in data -in metadata -in links between databases
53
Error propagation
Database errors can propagate to other databases or scientific literature > propagation when using erroneous information for annotation of other biological entity (in another database)
54
Privacy (GDPR)
Not much public clinical data available > regulation for data protection (contracts etcetera)