Quick Fire Flashcards
(40 cards)
Batch Size
Batch size controls the accuracy of the estimate of the error gradient when training neural networks
The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
Batch Gradient Descent. Batch Size = Size of Training Set
Stochastic Gradient Descent. Batch Size = 1
Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
The error gradient is a statistical estimate. The more training examples used in the estimate, the more accurate this estimate will be and the more likely that the weights of the network will be adjusted in a way that will improve the performance of the model. The improved estimate of the error gradient comes at the cost of having to use the model to make many more predictions before the estimate can be calculated, and in turn, the weights updated.
Alternately, using fewer examples results in a less accurate estimate of the error gradient that is highly dependent on the specific training examples used.
This results in a noisy estimate that, in turn, results in noisy updates to the model weights, e.g. many updates with perhaps quite different estimates of the error gradient. Nevertheless, these noisy updates can result in faster learning and sometimes a more robust model.
Smaller batch sizes are used for two main reasons:
Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error.
Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU).
Epoch
hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
Trade off between batch size and learning rate
There’s three types of batch size. 1 < Batch size < full training set
Let’s say batch size = b
Let’s say epoch = e
And size of Training data = t
Weight updates are made at the end of every batch calculation of error. In total eb weight updates to the model
If the b is 1, for the same set of other parameters… there are 1t*e updates. If b is t… then there are only e weight updates. For any mini batch style updates it is in between.
With batch style training, the model converges at 0.81 and test set does slightly better. The loss curve isn’t too noisy. If we set b=1 with the same learning rate you would see the model performance be poor and not converge. WMWM ZIGZAG PATTERN.
This would indicate that making the learning rate smaller would help with model training. Doing so shows that model performance hits 0.81 in a quarter of the number of epochs that we had originally, which makes sense based on the quick calcs above. Although it still looks noisy and hasn’t converged as much.
We can do something similar where we hold learning rate constant instead and try to find b that makes it learn quickly and well. To find it as a mini batch size.
More noisy updates to the model require a smaller learning rate, whereas less noisy more accurate estimates of the error gradient may be applied to the model more liberally. We can summarize this as follows:
Batch Gradient Descent: Use a relatively larger learning rate and more training epochs.
Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.
Batch sizeis a slider on the learning process.
Small values give a learning process that converges quickly at the cost of noise in the training process.
Large values give a learning process that converges slowly with accurate estimates of the error gradient.
Batch Gradient Descent
Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called atraining epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
Upsides
Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic gradient descent.
The decreased update frequency results in a more stable error gradient and may result in a more stable convergence on some problems.
The separation of the calculation of prediction errors and the model update lends the algorithm to parallel processing based implementations.
Downsides
The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.
The updates at the end of the training epoch require the additional complexity of accumulating prediction errors across all training examples.
Commonly, batch gradient descent is implemented in such a way that it requires the entire training dataset in memory and available to the algorithm.
Model updates, and in turn training speed, may become very slow for large datasets.
Stochastic Gradient Descent
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
The update of the model for each training example means that stochastic gradient descent is often called anonline machine learning algorithm.
Upsides
The frequent updates immediately give an insight into the performance of the model and the rate of improvement.
This variant of gradient descent may be the simplest to understand and implement, especially for beginners.
The increased model update frequency can result in faster learning on some problems.
The noisy update process can allow the model to avoid local minima (e.g. premature convergence).
Downsides
Updating the model so frequently is more computationally expensive than other configurations of gradient descent, taking significantly longer to train models on large datasets.
The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the model error to jump around (have a higher variance over training epochs).
The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error minimum for the model.
Mini batch Gradient Descent
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.
Upsides
The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.
The batched updates provide a computationally more efficient process than stochastic gradient descent.
The batching allows both the efficiency of not having all training data in memory and algorithm implementations.
Downsides
Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.
Error information must be accumulated across mini-batches of training examples like batch gradient descent.
Gradient Descent
Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression.
It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error.
The goal of the algorithm is to find model parameters (e.g. coefficients or weights) that minimize the error of the model on the training dataset. It does this by making changes to the model that move it along a gradient or slope of errors down toward a minimum error value. This gives the algorithm its name of “gradient descent.”
K means clustering
aims topartitionnobservations intokclusters in which each observation belongs to theclusterwith the nearestmean(cluster centers or clustercentroid), serving as a prototype of the cluster.
Given a set of observations (x1,x2, …,xn), where each observation is ad-dimensional real vector,k-means clustering aims to partition thenobservations intok(≤n) setsS={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e.variance).
This is the same as maximizing the between-cluster sum of squares because total variance is constant
1.kinitial “means” (in this casek=3) are randomly generated within the data domain (shown in color). 2.K clusters are created by associating every observation with the nearest mean. The partitions here represent theVoronoi diagramgenerated by the means. 3. Thecentroidof each of thekclusters becomes the new mean. 4. Repeat 2&3 till convergence or Max iterations has been reached.
algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters.
Look at formulas here…
https://en.m.wikipedia.org/wiki/K-means_clustering#:~:text=k%2Dmeans%20clustering%20is%20a,a%20prototype%20of%20the%20cluster.
K means initialization
The Forgy method randomly chooseskobservations from the dataset and uses these as the initial means. The Forgy method tends to spread the initial means out.
The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s randomly assigned points. Random Partition places all of them close to the center of the data set.
Kmeans++
the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from theremainingdata points with probability proportional to its squared distance from the point’s closest existing cluster center.
The exact algorithm is as follows:
Choose one center uniformly at random among the data points.
For each data pointxnot chosen yet, compute D(x), the distance betweenxand the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a pointxis chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 untilkcenters have been chosen.
Now that the initial centers have been chosen, proceed using standardk-means clustering.
Drawbacks kmeans
-Choosingkmanually. Elbow method, information criterion, silhouette score
-Being dependent on initial values.
For a lowk, you can mitigate this dependence by running k-means several times with different initial values and picking the best result. Askincreases, you need advanced versions of k-means to pick better values of the initial centroids (calledk-means seeding). For a full discussion of k- means seeding see,A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithmby M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.
-Clustering data of varying sizes and density.
k-means has trouble clustering data where clusters are of varying sizes and density. If assumes a spherical distribution. To cluster such data, you need to generalize k-means as described in theAdvantagessection.
Center plot: Allow different cluster widths, resulting in more intuitive clusters of different sizes.
Right plot: Besides different cluster widths, allow different widths per dimension, resulting in elliptical instead of spherical clusters, improving the result.
http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/clustering.pdf
k-means which assumes that clusters are convex shaped.
-Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
-Scaling with number of dimensions.
As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by usingPCAon the feature data, or by using “spectral clustering” to modify the clustering algorithm as explained below.
https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages#:~:text=Disadvantages%20of%20k%2Dmeans&text=As%20increases%2C%20you%20need%20advanced,Means%20Clustering%20Algorithm%20by%20M.
Spectral clustering
Spectral clusteringavoids the curse of dimensionality by adding a pre-clustering step to your algorithm:
Reduce the dimensionality of feature data by using PCA.
Project all data points into the lower-dimensional subspace.
Cluster the data in this subspace by using your chosen algorithm.
Mini batch k means
Its main idea is to use small
random batches of examples of a fixed size so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters and this is repeated until
convergence. Each mini batch updates the clusters using a convex combination of the values of the
prototypes and the examples, applying a learning rate that decreases with the number of iterations.
This learning rate is the inverse of number of examples assigned to a cluster during the process.
As the number of iterations increases, the effect of new examples is reduced, so convergence can be
detected when no changes in the clusters occur in several consecutive iterations
Algorithm 1 Mini Batch K-Means algorithm
Given: k, mini-batch size b, iterations t, data set X
Initialize each c ∈ C with an x picked randomly from X
v ← 0
for i ← 1 to t do
M ← b examples picked randomly from X
for x ∈ M do
d[x] ← f(C,x)… distance matrix between centroids and x
end
for x ∈ M do
c ← d[x]…. find nearest centroid for x
v[c] ← v[c] + 1 …. update cluster count
η ← 1/v[c]… learning rate is inverse of cluster size, placing more importance on first samples
c ← (1-η)c+ηx… convex update of cluster center
end
end
https://www.google.com/url?sa=t&source=web&rct=j&url=https://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf&ved=2ahUKEwi91Zy298zvAhWSFlkFHQZ8BGIQFjANegQIDBAC&usg=AOvVaw3bEHzPtTrItiyfmfv-VCWm&cshid=1616733583017
Dbscan
Density based clustering
algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept ofcore samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples)
Letεbe a parameter specifying the radius of a neighborhood with respect to some point. MinPts is the minimum number of points that must be in that radius for point p to be considered as a core point (minimum number of points required to form a dense region).
A pointpis acore pointif at leastminPtspoints are within distanceεof it (includingp).
A pointqisdirectly reachablefrompif pointqis within distanceεfrom core pointp. Points are only said to be directly reachable from core points.
A pointqisreachablefrompif there is a pathp1, …,pnwithp1=pandpn=q, where eachpi+1is directly reachable frompi. Note that this implies that the initial point and all points on the path must be core points, with the possible exception ofq.
All points not reachable from any other point areoutliersornoise points.
Now ifpis a core point, then it forms aclustertogether with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its “edge”, since they cannot be used to reach more points. See diagram here https://en.m.wikipedia.org/wiki/DBSCAN
only core points can reach non-core points. So if you see the diagram, if there was a point inthe radius e of a yellow (non-core) point, but it did not fall under the radius of a core point (not reachable by the above definition ), then this would be an outlier too.so a non-core point may be reachable, but nothing can be reached from it
DBSCAN algorithm can be abstracted into the following steps:[4]
Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
Find theconnected componentsofcorepoints on the neighbor graph, ignoring all non-core points.
Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise.
A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time.
Dbscan ads disadvantages
Ads
DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed tok-means.
DBSCAN can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
DBSCAN has a notion of noise, and is robust tooutliers.
DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)
DBSCAN is designed for use with databases that can accelerate region queries, e.g. using anR* tree.
The parameters minPts and ε can be set by a domain expert, if the data is well understood
Disadvantages
DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4]both on core points and noise points, DBSCAN is deterministic. DBSCAN*[8]is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.
The quality of DBSCAN depends on thedistance measureused in the function regionQuery(P,ε). The most common distance metric used isEuclidean distance. Especially forhigh-dimensional data, this metric can be rendered almost useless due to the so-called “Curse of dimensionality”, making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance.
DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters.[9]
If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.
Parameter estimation for DBSCAN
Hi
Parameter estimation for kmeans
Elbow method
Silhouette score or other metrics such as the Davies-Bouldin index…
Affinity propagation
clustering algorithmbased on the concept of “message passing” between data points.
Similar tok-medoids, affinity propagation finds “exemplars,” members of the input set that are representative of clusters
Letx1throughxnbe a set of data points, with no assumptions made about their internal structure, and letsbe a function that quantifies the similarity between any two points, such thats(i,j) >s(i,k)iffxiis more similar toxjthan toxk. For this example, the negative squared distance of two data points was used
The diagonal ofsi.e.s(i,i)is particularly important, as it represents the instance preference, meaning how likely a particular instance is to become an exemplar
The algorithm proceeds by alternating between two message-passing steps, which update two matrices:[1]
The “responsibility” matrixRhas valuesr(i,k)that quantify how well-suitedxkis to serve as the exemplar forxi, relative to other candidate exemplars forxi.
The “availability” matrixAcontains valuesa(i,k)that represent how “appropriate” it would be forxito pickxkas its exemplar, taking into account other points’ preference forxkas an exemplar.
Both matrices are initialized to all zeroes, and can be viewed aslog-probabilitytables. Iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or some predetermined number (of iterations) is reached. The exemplars are extracted from the final matrices as those whose ‘responsibility + availability’ for themselves is positive.
main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the orderO(N2T), whereNis the number of samples andTis the number of iterations until convergence. Further, the memory complexity is of the orderO(N2)if a dense similarity matrix
The damping parameter is to avoid oscillations during the iterative update
Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.
it does not enforce equal-size clusters, and it can choose automatically the number of clusters from the data
Mean shift clustering
clustering aims to discoverblobsin a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region using the bandwidth parameter. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.
Mean shift is a procedure for locating the maxima of a density function given discrete data sampled from that function.
This is an iterative method, and we start with an initial estimatex. Let akernel functionK(x_{i}-x)be given. This function determines the weight of nearby points for re-estimation of the mean. Often Gaussian kernel function is used. Exp(-c(x-xi)2). Then the mean is estimated around x using its neighborhood of points, a set of points where K(xi) ≠0. Then it sets x to be m(x) the mean, and the algorithm proceeds until convergence or Max iterations. In each iteration of the algorithm, s
Procedure
To explain mean-shift we will consider a set of points in two-dimensional space like the above illustration. We begin with a circular sliding window centered at a point C (randomly selected) and having radius r as the kernel. Mean shift is a hill-climbing algorithm that involves shifting this kernel iteratively to a higher density region on each step until convergence.
At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name). The density within the sliding window is proportional to the number of points inside it. Naturally, by shifting to the mean of the points in the window it will gradually move towards areas of higher point density.
We continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the kernel. Check out the graphic above; we keep moving the circle until we no longer are increasing the density (i.e number of points in the window).
This process of steps 1 to 3 is done with many sliding windows until all points lie within a window. When multiple sliding windows overlap the window containing the most points is preserved. The data points are then clustered according to the sliding window in which they reside.
An illustration of the entire process from end-to-end with all of the sliding windows is shown below. Each black dot represents the centroid of a sliding window and each gray dot is a data point.
Mean shift clustering ad disads
Strengths
Mean shift is an application-independent tool suitable for real data analysis.
Does not assume any predefined shape on data clusters.
It is capable of handling arbitrary feature spaces.
The procedure relies on choice of a single parameter: bandwidth.
The bandwidth/window size ‘h’ has a physical meaning
In contrast to K-means clustering, there is no need to select the number of clusters as mean-shift automatically discovers this. That’s a massive advantage. The fact that the cluster centers converge towards the points of maximum density is also quite desirable as it is quite intuitive to understand and fits well in a naturally data-driven sense. The drawback is that the selection of the window size/radius “r” can be non-trivial
Weaknesses
The selection of a window size is not trivial.
Inappropriate window size can cause modes to be merged, or generate additional “shallow” modes.
Often requires using adaptive window size
Outliers are given some cluster label anyways
Hierarchical clustering
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.
Agglomerative is a bottom up approach. Divisive is top down.
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriatemetric(a measure ofdistancebetween pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.
Linkage
linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
Maximum orcomplete-linkage clustering max \,{\,d(a,b):a\in A,\,b\in B\,} to find distance between clusters to see if merging/splitting is required
Minimum orsingle-linkage clustering
Unweighted average linkage clustering (orUPGMA)
Weighted average linkage clustering (orWPGMA)
Centroid linkage clustering
.Minimum energy clustering
Optionally, one can also construct adistance matrixat this stage, where the number in thei-th rowj-th column is the distance between thei-th andj-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters.On each iteration, we combine two clusters into one. The two clusters to be combined are selected as those with the smallest average linkage. I.e according to our selected distance metric, these two clusters have the smallest distance between each other and therefore are the most similar and should be combined
Wardminimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
Maximumorcomplete linkageminimizes the maximum distance between observations of pairs of clusters.
Average linkageminimizes the average of the distances between all observations of pairs of clusters.
Single linkageminimizes the distance between the closest observations of pairs of clusters.
Agglomerative cluster has a “rich get richer” behavior that leads to uneven cluster sizes. In this regard, single linkage is the worst strategy, and Ward gives the most regular sizes. However, the affinity (or distance used in clustering) cannot be varied with Ward, thus for non Euclidean metrics, average linkage is a good alternative. Single linkage, while not robust to noisy data, can be computed very efficiently and can therefore be useful to provide hierarchical clustering of larger datasets. Single linkage can also perform well on non-globular data.
Connectivity constraints
connectivity constraints can be added to this algorithm (only adjacent clusters can be merged together), through a connectivity matrix that defines for each sample the neighboring samples following a given structure of the data. For instance, in the swiss-roll example below, the connectivity constraints forbid the merging of points that are not adjacent on the swiss roll, and thus avoid forming clusters that extend across overlapping folds of the roll. Can improve speed with this apriori information. Connectivity constraints and single, complete or average linkage can enhance the ‘rich getting richer’ aspect of agglomerative clustering
Notes on distance metric
l1distance is often good for sparse features, or sparse noise: i.e. many of the features are zero, as in text mining using occurrences of rare words.
cosinedistance is interesting because it is invariant to global scalings of the signal.
The guidelines for choosing a metric is to use one that maximizes the distance between samples in different classes, and minimizes that within each class
Hierarchical clustering ads and disadvantages
Don’t need to specify number of clusters
We can use the dendrogram to visually identify the clusters bar on our choice of linkage and metric
particularly good use case of hierarchical clustering methods is when the underlying data has a hierarchical structure and you want to recover the hierarchy
Computation time is o(n3), much worse than kmeans
Optics clustering
Ordering points to identify the clustering structure
basic idea is similar toDBSCAN,but it addresses one of DBSCAN’s major weaknesses: the problem of detecting meaningful clusters in data of varying density.
OPTICS requires two parameters:ε, which describes the maximum distance (radius) to consider, andMinPts, describing the number of points required to form a cluster. A pointpis acore pointif at leastMinPtspoints are found within itsε-neighborhoodNp(including pointpitself)
OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned acore distancethat describes the distance to theMinPtsth closest point:
Reachability dist and core distance on Wikipedia.
generalization of DBSCAN that relaxes theepsrequirement from a single value to a value range. OPTICS algorithm builds areachabilitygraph, which assigns each sample both areachability_distance, and a spot within the clusterordering_attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership
HDBSCAN implementation is multithreaded, and has better algorithmic runtime complexity than OPTICS, at the cost of worse memory scaling. For extremely large datasets that exhaust system memory using HDBSCAN, OPTICS will maintainn(as opposed ton2) memory scaling
Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN.
Clusters are then extracted using a DBSCAN-like method (cluster_method = ‘dbscan’) or an automatic technique proposed in1.
The default cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define what counts as a steep slope using the parameterxi. There are also other possibilities for analysis on the graph itself, such as generating hierarchical representations of the data through reachability-plot dendrograms, and the hierarchy of clusters detected by the algorithm can be accessed through thecluster_hierarchy_parameter
Clustering performance evaluation- Rand Index
Rand index
- function that measures thesimilarityof the two assignments, ignoring permutations
- The two assignments are label_true and label_pred for each sample…
- The Rand index does not ensure to obtain a value close to 0.0 for a random labelling. The adjusted Rand indexcorrects for chanceand will give such a baseline.
- Furthermore, bothrand_scoreadjusted_rand_scorearesymmetric: swapping the argument does not change the scores. They can thus be used asconsensus measures.
- Perfect labeling is scored 1.0
- Poorly agreeing labels (e.g. independent labelings) have lower scores, and for the adjusted Rand index the score will be negative or close to zero. However, for the unadjusted Rand index the score, while lower, will not necessarily be close to zero
Adjusted ri =( ri - E[ri] )/(max(ri)-E[ri])
Let there be two subsets X and Y:
{\displaystyle a}, the number of pairs of elements in{\displaystyle S}that are in thesamesubset in{\displaystyle X}and in thesamesubset in{\displaystyle Y}
{\displaystyle b}, the number of pairs of elements in{\displaystyle S}that are indifferentsubsets in{\displaystyle X}and indifferentsubsets in{\displaystyle Y}
{\displaystyle c}, the number of pairs of elements in{\displaystyle S}that are in thesamesubset in{\displaystyle X}and indifferentsubsets in{\displaystyle Y}
{\displaystyle d}, the number of pairs of elements in{\displaystyle S}that are indifferentsubsets in{\displaystyle X}and in thesamesubset in{\displaystyle Y}
The Rand index,{\displaystyle R}, is:[1][2]
RI={\frac {a+b}{a+b+c+d}}={\frac {a+b}{n \choose 2}}}
Normalized to take chance into account.
Se example:
https://davetang.org/muse/2017/09/21/the-rand-index/
Clustering performance evaluation- mutual information score
Basically it is the expected information gain. Theexpected valueof the information gain is themutual informationI(X;A)ofXandA– i.e. the reduction in theentropyofXachieved by learning the state of therandom variableA.
themutual information(MI) of tworandom variablesis a measure of the mutualdependencebetween the two variables. More specifically, it quantifies the “amount of information” (inunitssuch asshannons, commonly called bits) obtained about one random variable through observing the other random variable
mutual information measures the information that X and Yshare: It measures how much knowing one of these variables reduces uncertainty about the other.
MI = D_KL(P(X,Y) ll P(X) tensorproduct P(Y))
For example, if{\displaystyle X}and{\displaystyle Y}are independent, then knowing{\displaystyle X}does not give any information about{\displaystyle Y}and vice versa, so their mutual information is zero. At the other extreme, if{\displaystyle X}is a deterministic function of{\displaystyle Y}and{\displaystyle Y}is a deterministic function of{\displaystyle X}then all information conveyed by{\displaystyle X}is shared with{\displaystyle Y}: knowing{\displaystyle X}determines the value of{\displaystyle Y}and vice versa. As a result, in this case the mutual information is the same as the uncertainty contained in{\displaystyle Y}(or{\displaystyle X}) alone, namely theentropyof{\displaystyle Y}(or{\displaystyle X}). Moreover, this mutual information is the same as the entropy of{\displaystyle X}and as the entropy of{\displaystyle Y}
Adjusted MI is Normalized against chance only.
ADS
Random (uniform) label assignments have a AMI score close to 0.0for any value ofn_clustersandn_samples(which is not the case for raw Mutual Information or the V-measure for instance).
Upper bound of 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, an AMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).
Drawbacks
Contrary to inertia,MI-based measures require the knowledge of the ground truth classeswhile almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However MI-based measures can also be useful in purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection.
NMI and MI are not adjusted against chance.
mutual information and also the normalized variant is not adjusted for chance and will tend to increase as the number of different labels (clusters) increases, regardless of the actual amount of “mutual information” between the label assignments.