Week 7: Missing Data and Clustering Flashcards

1
Q

REVERSED

One for which the within-cluster variation (W(Ck)) is as small as possible

A

What is a “good” k-means clustering?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

REVERSED

imp = mice(data, seed=1, m=20, print=FALSE) #imputes missing data 20 times
fit = with(imp, lm(formula)) #fits the specified model to the imputed data, result will be m different model results
summary(pool(fit)) #pools the 20 sets of estimated parameters

A

How do you perform multiple imputation and fit a model in R?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

REVERSED

O(n^3)

A

What is the time complexity of hierarchical clustering?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

REVERSED

used to do a summarise transformation across all variables

A

What does summarise_all() do?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

REVERSED

  • Scaling to standard deviation 1 gives equal importance to each variable in the clustering
  • Useful when variables are measured on different scales
A

When and why should variables be scaled before computing dissimilarity?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

REVERSED

library(“mice”)
imp = mice(data, method = “mean”, m=1, maxit=1)

#another way:
replace\_na(data, variable = mean(variable, na.rm=TRUE, …)
A

How do you perform mean imputation in R? (2 ways)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

REVERSED

Only mean is unbiased under NDD
Standard error is too small
Disturbs relations between variables

A

Under what conditions is mean imputation unbiased? What happens to the standard error?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

REVERSED

  • Only look at unsupervised bit: data and clustering and quantify how successful clustering is
  • Popular measures: average silhouette width (ASW) - how close points are to other clusters, gap statistic
A

What are internal validation indices and what are some popular methods?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

REVERSED

LOCF is always biased
Standard error is too small

A

Under what conditions is LOCF imputation unbiased? What happens to the standard error?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

REVERSED

data %>% group_by(variable) %>% summarise_all(function(x) sum(is.na(x))

A

What is the code for getting the number of na’s for each variable grouped by a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

REVERSED

Mean, regression weight and correlation are unbiased only under NDD.

A

Under what conditions is pairwise deletion unbiased?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

REVERSED

Replace missing data by the mean or the mode for categorical data

A

What is mean imputation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

REVERSED

Mean, regression weights and correlation are unbiased under SDD
Standard error is too small

A

Under what conditions is stochastic regression imputation unbiased? What happens to the standard error?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

REVERSED

A refinement of regression imputation that attempts to address correlation bias by adding noise to the predictions. This method first estimates the intercept, slope and residual variance under the linear model, then calculates the predicted value for each missing value and adds a random draw from the residual to the prediction

A

What is stochastic regression imputation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

REVERSED

It is not suitable for clustering non-spherical groups of objects

A

What is the disadvantage of a k-medoids clustering?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

REVERSED

options(na.action = na.omit)

A

How do you change the settings to always omit NAs?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

REVERSED

  • Retains the full dataset and allows for systematic differences between the observed and unobserved data by the inclusion of the response indicator
  • Can be useful to estimate the treatment effect in randomised trials when a baseline covariance is partially observed
A

What are the advantages of the indicator method? (2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

REVERSED

  • Complete: Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
  • Single: Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.
  • Average: Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
  • Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B.
A

What are the 4 types of linkage and how they work?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

REVERSED

library(patchwork) #allows you to display ggplots together using plot1 + plot2

A

What is a way to display ggplots together in R?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

REVERSED

  1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations
  2. Iterate until the cluster assignments stop changing:
    a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster
    b) Assign each observation to the cluster whose centroid is closest (using Euclidean distance)
    * When the result no longer changes, the local optimum has been reached
A

What is the algorithm for k-means clustering?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

REVERSED

library(fpc)
clusterboot(data, clustermethod = hclustCBI, method = “complete“, k = 3) #for kmeans use kmeansCBI

Gives Jaccard bootstrap mean for each cluster. Generally stability less than 0.6 is considered unstable. Clusters with stability above 8.5 are highly stable (likely to be real clusters)

A

How do you perform stability assessment in R?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

REVERSED

  • Bottom-up or agglomerative clustering: the dendrogram is built starting from the leaves and combining clusters up to the trunk
  • Divisive clustering: starting with one cluster and keep splitting the most different
A

What are two ways of hierarchical clustering?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

REVERSED

mean(y, na.rm=TRUE)

A

How do you do a mean calculation, removing missing values first?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

REVERSED

na.action(model)

A

How do you show the indices of NAs in a model?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

REVERSED

Are the clusters associated with an external feature Y? Find data for Y to evaluate

A

How do you use external information to evaluate clustering results?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

REVERSED

  • K-means clustering applied to images
  • Goal is image compression -> less storage. Cluster pixels and replace them by their cluster centroid
  • File size increases with number of clusters
  • Image loss decreases with number of clusters
A

What is vector quantisation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

REVERSED

distances = dist(data, method = “euclidean”)
result = hclust(distances, method = “average”)

library(ggdendro)
ggdendrogram(result)

#select number of clusters using cutoff point: h= gives the height or k= gives the number of clusters
cutree(result, h=2)
#results in vector with number of cluster for each variable. Good to use as.factor when going to plot by colour
A

How do you perform hierarchical clustering in R and plot the dendrogram and select the number of clusters?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

REVERSED

Every data point has some likelihood of being missing. The process that governs these probabilities is called the missing data mechanism or response mechanism. There are three categories of missing data: MCAR, MAR and MNAR

A

What is Rubins theory of classifying missing data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

REVERSED

colSums(is.na(data))

A

How do you display the number of NAs in each variable of a dataset?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

REVERSED

  • Reduce variables into 2D “manifold” for visualisation
  • Popular techniques: UMAP, t-SNE, MDS, Discriminant coordinates, PCA
A

How do you use visual exploration to evaluate clustering results when there are many variables?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

REVERSED

distances = dist(data)
result = hclust(distances)
clusters = cutree(result, 2)
silhouette\_scores = silhouette(clusters, distances)

plot(silhouette_scores)

A

How do you perform average silhouette width analysis in R? what does the result tell us?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

REVERSED

Find more data about the causes for the missingness, or to perform what-if analyses to see how sensitive the results are under various scenarios.

A

What are strategies to handle MNAR data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

REVERSED

First builds a model from the observed data
Predictions for the incomplete cases are then calculated under the fitted model and serve as replacements for the missing data

A

What is regression imputation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

REVERSED

imp = mice(data, method = “norm.nob”, m=1, maxit=1, seed=1, print=FALSE)
#method norm.nob requests a plain, non-Bayesian, stochastic regression method
A

How do you perform stochastic regression imputation in R?

35
Q

REVERSED

Specify the desired number of clusters K, then the K-means algorithm will assign each observation to exactly one cluster

A

What is k-means clustering?

36
Q

REVERSED

map_dbl(data, mean) #use map_int if data is int etc.

summarise_all(data, mean)

A

What is the code for returning the mean of each variable in a dataset? (2 ways)

37
Q

REVERSED

tidyr::fill(data, variable)

A

How do you perform LOCF imputation in R?

38
Q

REVERSED

Mean, regression weight and correlation are unbiased only under NDD. Standard error is too large

A

Under what conditions is listwise deletion unbiased? What happens to the standard error?

39
Q

REVERSED

2^(n-1)

A

How many possible re-orderings of the dendrogram are there without changing its meaning?

40
Q

REVERSED

  1. Initialise: select k random points as the medoids
  2. Assign each data point to the closest medoid by using any distance method (e.g. euclidean)
  3. For each data point of cluster i, its distance from all other data points is computed and added. The point of ith cluster for which the computed sum of distances from other points is minimal is assigned as the medoid for that cluster
  4. Repeat steps 2 and 3 until the medoids stop moving
A

What is the k-medoids clustering algorithm?

41
Q

REVERSED

Can be defined in many ways, common approach is euclidean distance

A

What is within cluster variation?

42
Q

REVERSED

Last observation carried forward (LOCF) and baseline observation carried forward (BOCF) are ad-hoc imputation methods for longitudinal data
Take the previous observed value as a replacement for the missing data

A

What are LOCF and BOCF imputation methods?

43
Q

REVERSED

Eliminate all cases with one or more missing values

A

What is listwise deletion/complete case analysis?

44
Q

REVERSED

Missing completely at random (MCAR) / Not data dependent (NDD): The probability of being missing is the same for all cases. The causes of the missing data are unrelated to the data.

Missing at random (MAR) / Seen data dependence (SDD): the probability of being missing is the same only within groups defined by the observed data

Missing not at random (MNAR) / Unseen data dependence (UDD): the probability of being missing varies for reasons that are unknown to us. It is missing because of the value you would have obtained.

A

Describe the three categories of missing data

45
Q

REVERSED

Considers two observations to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance

A

What is correlation-based distance?

46
Q

REVERSED

  • Create several (m) complete versions of the data (imputed datasets) by replacing the missing values by plausible data values using stochastic imputation
  • Estimate the parameters of interest from each imputed dataset. Typically done by applying the method that we would have used if the data was complete
  • Last step is to pool the m parameter estimates into one estimate by getting the average of the estimates Q*, and to estimate its variance
A

What is the procedure for multiple imputation?

47
Q

REVERSED

How much does the clustering change when: 1. Changing some hyperparameters, 2. Changing some observations (bootstrapping), 3. Changing some features
Check if observations are classified into same cluster across choices

A

What is stability assessment of clustering results?

48
Q

REVERSED

You can’t

A

How do you differentiate between SDD and UDD?

49
Q

REVERSED

  • Single linkage can result in extended, trailing clusters in which single observations are fused one at a time. Can’t separate clusters properly if their is noise between clusters
  • Centroid linkage can result in undesirable inversions, where two clusters are fused at a height below either of the individual clusters in the dendrogram.
  • Complete linkage tends to break large clusters.
A

What are the disadvantages of single and centroid and complete linkage?

50
Q

REVERSED

  • Single linkage can differentiate between non-eliptical clusters
  • Complete linkage gives well-separated clusters if their is noise between the clusters
A

What are the advantages of single and complete linkage?

51
Q

REVERSED

Deduce the missing data from the data you have e.g. from height and weight can calculate BMI

A

What is deductive imputation?

52
Q

REVERSED

When the data doesn’t have a hierarchical structure. e.g. when the best division into 2 groups is by gender but the best division into 3 groups is by nationality

A

When does hierarchical clustering give worse results than k-means clustering?

53
Q

REVERSED

The idea of k-medoids is to make the final centroids as actual data points, making them interpretable

A

What is k-medoids clustering?

54
Q

REVERSED

NA

A

What does mean(y) return when y has missing values?

55
Q

REVERSED

  • Small decisions such as k and dissimilarity measure have big impacts on the clusters
  • Validating the clusters obtained: clustering will always result in clusters, but do they represent true subgroups in the data or are they simply clustering the noise
  • Since k-means and hierarchical clustering force every observation into a cluster, the clusters found may be heavily distorted due to the presence of outliers that do not belong to any cluster
A

What are three practical issues with clustering?

56
Q

REVERSED

Mean and regression weights are unbiased under SDD
-for regression weights under SDD, only if the factors that influence the missingness are part of the regression model
Standard error is too small

A

Under what conditions is regression imputation unbiased? What happens to the standard error?

57
Q

REVERSED

Use na.omit()

A

How do you perform listwise deletion in R?

58
Q

REVERSED

  1. Begin with n observations and a measure (such as Euclidean distance) of all the n(n-1)/2 pairwise dissimilarities. Treat each observation as its own cluster.
  2. For i=n, n-1, … 2:
    a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are the least dissimilar. Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed
    b) Compute the new pair-wise inter-cluster dissimilarities among the i-1 remaining clusters
A

What is the hierarchical clustering algorithm?

59
Q

REVERSED

means_cluster = kmeans(data, 3) #perfoms k-means clustering with k=3

A

How do you perform k-means clustering in R? and what does the output consist of?

60
Q

REVERSED

imp$data #shows original data
imp$imp #shows imputed data
complete(imp, 3) #extracts the 3rd completed dataset of the m imputations

A

Once you have performed multiple imputation and stored it as “imp”, what codes are there to access different parts of the imputation?

61
Q

REVERSED

lm(y~x, data, na.action = na.omit)

A

How do you fit a linear model, removing missing values first?

62
Q

REVERSED

  • Within-dataset variance: the conventional sampling variance caused from taking a sample rather than entire population, the uncorrected standard error
  • Between-dataset variance: extra variance cause by the missing data
  • Simulation error: extra variance caused by the fact that the estimator is based on a finite amount of datasets m. Less of a problem with machines as m can be large
A

What does the variance of the parameter estimates from multiple imputation consist of?

63
Q

REVERSED

imp = mice(data, method = ) #imputes the dataset
complete(imp) #shows the complete imputed dataset
fit_imp = with(imp, lm(a~b)) #uses the imputed dataset to fit a specified model

A

What is the general code for imputing data and fitting a model with mice in R?

64
Q

REVERSED

Clustering looks to find homogeneous subgroups among observations

A

What is the goal of clustering?

65
Q

REVERSED

imp = mice(data, method = “norm.predict”, seed=1, m=1, print=FALSE)

A

How do you perform regression imputation in R? (2 ways)

66
Q

REVERSED

md.pattern(data)

A

What is the code for creating a table display of the missingness patterns?

67
Q

REVERSED

SDD assumption (or NDD)

A

What type of data does multiple imputation assume?

68
Q

REVERSED

Lower cut = more clusters

A

Should you cut the dendrogram higher or lower for more clusters?

69
Q

REVERSED

NDD: Pr(M=1 | var1, var2) = Pr(M=1)

SDD: Pr(M=1 | var1, var2) = Pr(M=1 | var1)

UDD: Pr(M=1 | var1, var2) can’t be reduced

A

What are the forumlas for NDD, SDD and UDD? Where M indicates whether variable 2 is missing (1) or not (0)

70
Q

REVERSED

Look at the {0,1} missingness indicator M versus other features. If you can classify M from other features thence do not have NDD

A

How do you differentiate between NDD and SDD?

71
Q

REVERSED

Mean intercluster dissimilarity: for any two observations, look at the point in the tree where branches containing those observations are first fused. The height of this fusion indicates how different the observations are. Higher = less similar

A

What does the vertical axis/height of a dendrogram show?

72
Q

REVERSED

  • Can increase correlation between variables
  • Correlations are biased upwards
  • P-values are too optimistic
  • Variability systematically underestimated
A

What are the disadvantages of regression imputation? (4)

73
Q

REVERSED

  • Use of external information
  • Visual exploration
  • Stability assessment
  • Internal validation indices
A

List the 4 ways of evaluating clustering results

74
Q

REVERSED

  • Solves the problem of too small standard errors
  • Our level of confidence in a particular imputed value is expressed as the variation across the m completed datasets
  • Under the right conditions, the pooled estimates are unbiased and have the correct statistical properties
A

What are the advantages of multiple imputation? (3)

75
Q

REVERSED

Don’t impute, deal with missing values in the prediction model itself

A

What are embedded or model based methods for missing data?

76
Q

REVERSED

O(k*(n-k)^2)

A

What is the time complexity of k-medoids clustering?

77
Q

REVERSED

The dissimilarity between two clusters if one or both contains multiple observations

A

What is linkage?

78
Q

REVERSED

Calculates the mean and covariances of all available data. The matrix summary of statistics is then used for analysis and modelling

A

What is pairwise deletion/available case analysis?

79
Q

REVERSED

  • The covariance matrix may not be positive-definite
  • Problems are more severe for highly correlated variables
  • Requires numerical data that follows approximate normal distribution
A

What are the disadvantages of pairwise deletion?

80
Q

REVERSED

The indicator method replaces each missing value by a zero and extends the regression model by the response indicator. This is applied to each incomplete variable. Then analyse the extended model

A

What is the indicator method of imputation?

81
Q

REVERSED

naprint(na.action(model))

A

How do you print the number of missing values in a model?

82
Q

REVERSED

Replacing missing values with guessed values

A

What is imputation?

83
Q

REVERSED

  • Large loss of information
  • Hopeless with many features
  • Inconsistencies in reporting as analysis on the same data often uses different sub-samples
  • Can lead to non-sensical sub-samples e.g. deleting data in time series analysis
A

What are the disadvantages of listwise deletion? (4)

84
Q

REVERSED

  • In k-means, squared Euclidean distance places the highest influence on the largest distances
  • K-means lacks robustness against outliers that produce very large distances
  • K-medoids is less sensitive to outliers
A

What are the advantages of k-medoids over k-means clustering?