Final Exam COPY Flashcards
Study for Final
12: Describe the Hill Climbing Algorithm
Move to the neighbor with the highest value.
- Guess an x in X.
- n* = argmax n in N(x) f(n)
- If f(n) > f(x): x=n, else stop
- Go to 2.
What are strengths and weaknesses of hill climbing?
It is very simple, very quick, and given a monotonic surface will converge.
It is NOT capable of dealing with very ‘rough’ surfaces and has difficulty finding optima with very narrow basins of attraction.
What is a method of helping hill climbing deal with rough surfaces?
Random restarts.
Describe Simulated Annealing
It starts as a random hill climb / walk and gradually segues into hill climbing by a temperature gradient / annealing schedule.
For a finite set of iterations:
1. Sample a new point x_t in N(x) [Neighborhood of x]
2. Jump to the new neighbor (move x to x_t) with a probability given by an acceptance probability function P(x, x_t, T)
3. Decrease temperature T
P(x, x_t, T) = {
1 if f(x_t) >= f(x)
exp( f(x_t)-f(x) / T ) elsewise
}
What is the probability of ending at any point in a simulated annealing iteration?
exp(f(x)) / Z_T
This is known as the Boltzmann Distribution and is seen frequently in engineering.
As T is high it smooths out the probability density function; as the temperature falls it will narrow the probability band onto the optimum.
Describe the genetic algorithm flow.
Start with your initial population P_0 of size K.
Repeat until converged:
{
Compute fitness of all x in P_t
Select the ‘most fit’ in a consistent manner.
Pair up individuals, replacing ‘least fit’ via crossover / mutation.
}
What are different kinds of crossover?
- Uniform crossover: Each bit has a uniform probability of swapping.
- Single point: Head / tail shifted.
- Multiple point
What is the probability model for MIMIC?
P = {
1 / z_theta if f(x) >= theta
0 elsewise
}
P^(theta_min) (x) -> uniform distribution over all points.
P^(theta_max) (x) -> uniform distribution over global optima.
Describe the algorithm for MIMIC
Rising water keeps structure memory and only keeps the high points.
What is a short, idiomatic, descriptive way of separating supervised from unsupervised learning?
Supervised learning uses labeled training data to generalize labels to new instances and is function approximation. Unsupervised makes sense out of unlabeled data and is data description.
How is domain knowledge injected in most clustering algorithms?
The measure of pairwise distance or similarity.
Describe a basic clustering problem.
Given a set of objects X with inter-object distances D output a set of partitions P_D in a consistent manner such that two points in the same cluster are in the same partition.
What are some examples of trivial clustering partition methods?
- A partition for every point.
- A single partition.
What is Single Linkage Clustering (SLC)?
A heirarchical (top down / bottom up) agglomerative (slowly fold points in) clustering algorithm
Describe the Algorithm and Runtime Complexity of Single Linkage Clustering
To create k clusters:
- Consider each object (point) a unique cluster (n objects exist.)
- Define the intercluster distance as the distance between the closest points in any two clusters. Note that two points within the same cluster have an intercluster distance of 0.
- Merge the two closest clusters.
- Repeat n-k times to make k clusters.
You eventually construct a dendrogram / tree with as many root nodes as k. As k approaches 1 a fully fleshed tree will join at a single root.
The runtime complexity is O(n^3) because there are n points which you must loop over to calculate distances for n-1 points (the distance to yourself is zero!) during the distance calculation step. You repeat this n-k times and as k approaches n/2 this will approach O(n^3) as a theoretical max and O(n^2) as a theoretical min.
Describe the Algorithm and Runtime Complexity of K-Means Linkage Clustering
To create k clusters:
- Initialize k centroids in the manner you choose (intelligently / randomly / …)
- Assign all n points to the nearest centroid. This requires n * k distance calculations
- Recalculate the centroids of each cluster by setting the centroid coordinates to be the intra cluster mean of the points.
- Go to step 2 and repeat until convergence
The runtime complexity for this is a little more complex. As k approaches n / 2 the complexity for step two will approach O(n^2) and this step is repeated until convergence. Intelligent seeding could get you close to an answer and run in far under n iterations but it’s likely a safe assumption to say a theoretical max of O(n^3) and a min of O(n^2).
Does K-Means Clustering Always Converge?
Yes, given enough time, and possibly only to a local minima depending on centroid initialization. Many implementations will run j randomized trials with different seeds. Some implementations will enforce initial clusters to be distant from one another which generally improves results.
Does K-Means Clustering Always Converge to a Good Answer?
What is good? K-means is always going to partition your space in such a manner that the points which are ‘close’ to each other as defined by your distance metric are in the same cluster assuming that k was selected correctly.
What is the Usual Distance Metric for K-Means?
Euclidean, though others may be used. Any smooth measure will work. (Source)
How is K-Means Related to Expectation Maximization?
K-Means can be seen as a special case of EM with a small all-equal, diagonal covariance matrix.
Why does Dimension Reduction Generally Improve K-Means Clustering Results?
K-Means, as a result of using Euclidean distances, suffers from the curse of dimensionality and distances will quickly exaggerate as dimensionality rises. Reducing the dimensionality can help, though possibly not solve, this issue.
Define K-Means Mathematically
P^t(x): Partition / cluster of object x
C^t_i: Set of all points in cluster i -> {x s.t. P(x)=i}
- Center^t_i = \sum(y in C_i)/|C_i| (This is a centroid, calculated with a mean)
- Center^O_i - > P^t(x) = argmin_i ||x-center^t_i||^2_2
To run this seed clusters and then repeat 2, 1 until convergence.
How are clusters scored?
Tightness / deviation is a good metric.
One metric is intra cluster squared distances:
E(P, center) = \sum_x ||Center_P(x) -x||^2_2
Which Optimization Routine is K-Means Most Like?
Hill climbing; K-Means moves in a single direction (minimizing intra cluster squared error) and only towards a peak.