Big Data Flashcards

Question 1

Q

what is big data?

Answer

A

refers to data sets too large or complex to process using traditional data processing methods
- large volumes of data, often comprising multiple data types
- there is substantial variation within the data which is complex to analyse
- integrative analysis of different types of big data reveals interactions between variables

Question 2

Q

who analyses big data?

Answer

A

computational methods and advanced statistics are used by bioinformaticians to analyse data

Question 3

Q

are big data experiments hypothesis-based or hypothesis-generating?

Answer

A

they are unbiased and hypothesis-generating
- they have huge power for discovery
- no need to choose and exclude markers in advance

Question 4

Q

where can big data be generated from?

Answer

A

DNA, RNA, protein molecules
cells, tissues
organisms

Question 5

Q

what are OMICs in big data?

Answer

A

genomics (DNA) and transcriptomics (RNA) - rely on sequencing of nucleic acids
- short read (Illumina) and long read (PacBio, Nanopore) sequencing
- RNA-seq
Proteomics and metabolomics
- mass spectometry
epigenomics
- ChIP-Seq, chromatin conformation

Question 6

Q

how can microscopy be used to generate big data?

Answer

A

high throughput imaging
fluorescent tagging in live cells
fixed cell staining
automated image analysis (machine learning/AI)

Question 7

Q

what big data can microscopy generate?

Answer

A

cell shape/cell type
subcellular protein localisation
cell differentiation
cell contractility and migration -> wound healing, sclerosis, metastasis
infection status
response to drugs

Question 8

Q

how can big data on human physiology/health be generated?

Answer

A

activity tracking
questionnaires
blood samples
whole body imaging
electronic health records

Question 9

Q

what knowledge does big data contribute to biology?

Answer

A

Development
Physiology
Drug safety and efficacy
Epidemiology – identifies relationships between environmental exposures / genetic predispositions and disease risk -> reduce exposures
Disease pathobiology – understand how interactions between exposures and predispositions affect health -> more effective diagnosis and treatment
Understanding of past events and prediction of future risks

Question 10

Q

what is transcriptomics?

Answer

A

studies gene expression and mRNA
- to determine the functional consequences of something on the expression of every gene in the tissue/organ/particular cell type of interest, or on a developmental stage

may be:
- wildtype vs mutant
- treated vs untreated
- untreated vs environmental change

Question 11

Q

what is an experimental strategy in transcriptomics? what steps does it involve

Answer

A

Extract mRNA from whole tissue or cell population, convert to cDNA
Prepare a sequencing ‘library’ containing all cDNA molecules in each biological sample
Sequence on an Illumina Next Generation Sequencing (NGS) machine.
Run series of computational steps (‘pipeline’ = quality control and
normalising/standardising the data) and make statistical comparisons
cDNA counts reflect mRNA expression level

identify genes exhibiting differential expression in the compared cell types

Question 12

Q

what plot can be used to display big data on transcriptomics?

Answer

A

volcano plot
- each dot represents a gene
- fold-change on x-axis is how much gene expression is increases/decreases
- significance is the Y-axis showing statistical significance of the difference in gene expression
- red dots = downregulated genes
- green dots = upregulated genes

Question 13

Q

what methods can help to interpret the consequences of gene expression changes?

Answer

A

gene ontology and biological pathway algorithms:
- These algorithms can be ran on the data to interpret consequences of gene expression changes
- Differentially expressed genes are fed into algorithms which extract information from databases about the functions of those genes and summarise it

Question 14

Q

how can the transcriptome of 100-10,000s of individual cells be collected?

Answer

A

single cell RNA-seq:
1. Dissect tissue, treat with enzymes
2. Single cell suspension – contains a mixture of cell types from tissue
3. Prepare libraries and sequence the transcriptome of every cell

Question 15

Q

what plot can be used to display the transcriptome of thousands of individual cells? what do these plots give insights into?

Answer

A

UMAP plots:
- Each dot is a cell
- Close = similar, far away = more different
- Each colour marks ‘clusters’ of similar cells

Potential insights into:
- Which genes are expressed by particular cells
- Cell type-specific gene expression changes
- Cell lineage/differentiation trajectories
- Tissue composition changes

Question 16

Q

how can genetic causes of disease/disease-associated genes be identified?

Answer

A

Genome-Wide Association Studies (GWAS) can identify genes affecting disease risk:
- humans have ~3x10^7 single nucleotide polymorphisms (SNPs) distributed randomly across the genome
- some people may have a different nucleotide in a certain position compared to others

GWAS studies identify SNP alleles that are found more frequently in patients (cases) compared to healthy individuals (controls)
- high scoring SNPs are thus associated with the disease and may play causative roles in the disease process

Question 17

Q

how are GWAS results presented?

Answer

A

Manhattan plots:
- these map DNA sequence variants associated with a disease at genome-scale
- strong disease-associated SNPs are outliers

Question 18

Q

why must we be careful when interpreting disease-associated SNPs?

Answer

A

The SNP does not necessarily affect the closest gene, it may affect a regulatory gene instead
The SNP that is disease-associated is not always responsible for causing increased/decreased disease risk. The SNP identified may infact be closeto the actual SNP that does cause the altered disease risk
- known as linkage disequilibrium

further investigation is required to understand SNP disease-association

Question 19

Q

what can combining GWAS results with gene expression data achieve?

Answer

A

Identify the gene(s) whose expression levels are linked to the SNP allele
Identify the cell type(s) in which the genetic variant(s) have functional consequences
Reveal how those variants might regulate gene expression

Big data integration reveals and refines insights into the biological process

Question 20

Q

Give an example of a population-scale big data project?

Answer

A

The 100,000 genomes project:
- Whole genome sequencing (WGS) to improve diagnosis of rare diseases and cancer care in the NHS through personalised medicine
- Data available to researchers
- 100,000 Genomes: 16.1% of rare disease patients received a molecular diagnosis

Question 21

Q

what is the UK biobank?

Answer

A

a prospective cohort study of 500,000 UK adults aged 40-69 at recruitment:
- Monitored over time: years/decades
- An integrated database for population-scale studies of health and disease, combining genetics, deep phenotyping, and electronic medical records:
- Demographic / socioeconomic
- Electronic health records (NHS)
- Physical activity monitoring
- Anatomical, Physiological, Biochemical, Genomic

Doctors can then use these past records to help with diagnosis – identify biomarkers of disease

Question 22

Q

why is big data important in the social gradient of health?

Answer

A

There is a social gradient in health, affecting Total and Healthy Life Expectancy:
- In England, poor neighbourhoods have a greater burden of ill-health than wealthy ones
- COVID-19 has had a proportionally higher impact on the most deprived areas of England

Big data is essential to understand how genetic predispositions, environmental exposures and social factors lead to disease

Brainscape's Knowledge GenomeTM

Big Data Flashcards

Brainscape's Knowledge Genome^TM