5. Genomic Data Analysis Flashcards

Question 1

Q

how can cell types be identified

Answer

A

physical appearance
presence/absence of surface proteins
isolation of cells and profiling of individual characteristics using sequencing technologies

Question 2

Q

how is single-cell RNA sequencing conducted

Answer

A

quality control (outlier removal) –> normalisation –> feature selection –> dim. reduction –> cell-cell distance –> unsupervised clustering

Question 3

Q

what is the purpose of quality control in single-cell rna sequencing

Answer

A

find unreliable cells, possible doublets

Question 4

Q

what is the purpose of feature selection & dim reduction in single-cell rna sequencing

Answer

A

find the most informative genes and strongest signals from background noise & decreasing processing time

Question 5

Q

what is the purpose of cell-cell distance in single-cell rna sequencing

Answer

A

to assist with clustering algos

Question 6

Q

what is the process of clustering

Answer

A

feature selection/dim reduction (optional) –> clustering algo design/selection –> cluster validation –> results interpretation

Question 7

Q

how are clusters validated

Answer

A

adjusted rand index - measures the similarity between clusters

visual inspection

cluster magnitude/cardinality

downstream performance

Question 8

Q

how can we improve clustering

Answer

A

try scaling data
try another similarity measure
check assumptions of algo match data distribution

Question 9

Q

how can we check the similarity measure

Answer

A

manually select known distant examples, and similar examples, and determine whether the distance metric is conducive

Question 10

Q

how can we check that we have the optimum number of clusters

Answer

A

plot loss v number of clusters
trial & error

Question 11

Q

what are some clustering algo categories

Answer

A

centroid based (kmeans)
connectively based
density based (GMM)
hierarchical
distribution based (DBSCAN)

Question 12

Q

what is centroid based

Answer

A

fast and efficient
separate datapoints by multiple centroids and squared distance of data points from them

Question 13

Q

what is the kmeans algo

Answer

A

pick k, number of clusters
pick centroids (random points in space)
assign each data point to centroid
update centroids by mean loc of their points
stop when centroids don’t move much else repeat

Question 14

Q

what are the clustering distances

Answer

A

euclidean, manhattan, minowski, hamming

Question 15

Q

what is euclidean

Answer

A

straight line, numeric data only

Question 16

Q

what is manhattan

Answer

A

distance along each axis/feature e.g. walking around city blocks, numeric data only

Question 17

Q

what is minowski

Answer

A

generalised form of manhattan & euclidean
uses higher orders to further exacerbate dissimilarities
only for numeric data

Question 18

Q

what is hamming

Answer

A

count similarities in each feature
accommodates categorical

Question 19

Q

what are the pros & cons of k-means

Answer

A

can get stuck in local optimum
clusters are spherical
can be tripped by outliers
numerical only
simple
quick

Question 20

Q

what are the two approaches to hierarchical clustering and how do they work

Answer

A

agglomerative - bottom up, start with many clusters & merge together in a tree based hierarchy

divisive - start from the top and break into smaller clusters, same tree based hierarchy

Question 21

Q

what are the pros and cons of hierarchical clustering

Answer

A

no assumptions on cluster number
can correspond to meaningful taxonomies
once a decision is made to combine two clusters, it cant be reversed
slow on large datasets

Question 22

Q

what is DBSCAN

Answer

A

density based spatial clustering of applications with noise

radius of original points (similar to centroids are user defined)

Question 23

Q

what is the DBSCAN algorithm

Answer

A

pick the radius that will define core points
pick a random core point and find points that fall within it’s radius, these are core points
continue to iterate across the data points finding core points that fall within the radius of other core points to finish the cluster
then find non-core points which are close to core points. don’t use non-core points to further extend the cluster
first cluster is finished. repeat for other clusters

Question 24

Q

what is the silhouette coefficient

Answer

A

an alternative to the elbow method

pick a range of values for K (e.g. 1-10)

for each point calculate silhouette CE
a(i) = distance from point i to every other point in cluster
b(i) = distance from point i to every other point in data

goal is a(i) < b(i) to
s(i) = b(i)-a(i)/ larger of b(i) and a(i)

worst silhouette CE = -1, best = 1

every cluster will have it’s own silhouette plot, and the average is taken of the SCI over each cluster

Brainscape's Knowledge GenomeTM

5. Genomic Data Analysis Flashcards

Brainscape's Knowledge Genome^TM