5. Genomic Data Analysis Flashcards

1
Q

how can cell types be identified

A
  • physical appearance
  • presence/absence of surface proteins
  • isolation of cells and profiling of individual characteristics using sequencing technologies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how is single-cell RNA sequencing conducted

A

quality control (outlier removal) –> normalisation –> feature selection –> dim. reduction –> cell-cell distance –> unsupervised clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the purpose of quality control in single-cell rna sequencing

A

find unreliable cells, possible doublets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the purpose of feature selection & dim reduction in single-cell rna sequencing

A

find the most informative genes and strongest signals from background noise & decreasing processing time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the purpose of cell-cell distance in single-cell rna sequencing

A

to assist with clustering algos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the process of clustering

A

feature selection/dim reduction (optional) –> clustering algo design/selection –> cluster validation –> results interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how are clusters validated

A

adjusted rand index - measures the similarity between clusters

visual inspection

cluster magnitude/cardinality

downstream performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how can we improve clustering

A
  • try scaling data
  • try another similarity measure
  • check assumptions of algo match data distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how can we check the similarity measure

A

manually select known distant examples, and similar examples, and determine whether the distance metric is conducive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how can we check that we have the optimum number of clusters

A

plot loss v number of clusters
trial & error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are some clustering algo categories

A
  • centroid based (kmeans)
  • connectively based
  • density based (GMM)
  • hierarchical
  • distribution based (DBSCAN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is centroid based

A
  • fast and efficient
  • separate datapoints by multiple centroids and squared distance of data points from them
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the kmeans algo

A
  • pick k, number of clusters
  • pick centroids (random points in space)
  • assign each data point to centroid
  • update centroids by mean loc of their points
  • stop when centroids don’t move much else repeat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are the clustering distances

A

euclidean, manhattan, minowski, hamming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is euclidean

A

straight line, numeric data only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is manhattan

A

distance along each axis/feature e.g. walking around city blocks, numeric data only

17
Q

what is minowski

A

generalised form of manhattan & euclidean
uses higher orders to further exacerbate dissimilarities
only for numeric data

18
Q

what is hamming

A

count similarities in each feature
accommodates categorical

19
Q

what are the pros & cons of k-means

A
  • can get stuck in local optimum
  • clusters are spherical
  • can be tripped by outliers
  • numerical only
  • simple
  • quick
20
Q

what are the two approaches to hierarchical clustering and how do they work

A

agglomerative - bottom up, start with many clusters & merge together in a tree based hierarchy

divisive - start from the top and break into smaller clusters, same tree based hierarchy

21
Q

what are the pros and cons of hierarchical clustering

A
  • no assumptions on cluster number
  • can correspond to meaningful taxonomies
  • once a decision is made to combine two clusters, it cant be reversed
  • slow on large datasets
22
Q

what is DBSCAN

A

density based spatial clustering of applications with noise

radius of original points (similar to centroids are user defined)

23
Q

what is the DBSCAN algorithm

A
  • pick the radius that will define core points
  • pick a random core point and find points that fall within it’s radius, these are core points
  • continue to iterate across the data points finding core points that fall within the radius of other core points to finish the cluster
  • then find non-core points which are close to core points. don’t use non-core points to further extend the cluster
  • first cluster is finished. repeat for other clusters
24
Q

what is the silhouette coefficient

A

an alternative to the elbow method

pick a range of values for K (e.g. 1-10)

for each point calculate silhouette CE
a(i) = distance from point i to every other point in cluster
b(i) = distance from point i to every other point in data

goal is a(i) < b(i) to
s(i) = b(i)-a(i)/ larger of b(i) and a(i)

worst silhouette CE = -1, best = 1

every cluster will have it’s own silhouette plot, and the average is taken of the SCI over each cluster