12: Describe the Hill Climbing Algorithm

Move to the neighbor with the highest value.

- Guess an x in X.
- n* = argmax n in N(x) f(n)
- If f(n) > f(x): x=n, else stop
- Go to 2.

What are strengths and weaknesses of hill climbing?

It is very simple, very quick, and given a monotonic surface will converge.

It is NOT capable of dealing with very ‘rough’ surfaces and has difficulty finding optima with very narrow basins of attraction.

What is a method of helping hill climbing deal with rough surfaces?

Random restarts.

Describe Simulated Annealing

It starts as a random hill climb / walk and gradually segues into hill climbing by a temperature gradient / annealing schedule.

For a finite set of iterations:

- Sample a new point x_t in N(x) [Neighborhood of x]
- Jump to the new neighbor (move x to x_t) with a probability given by an acceptance probability function P(x, x_t, T)
- Decrease temperature T

P(x, x_t, T) = {

1 if f(x_t) >= f(x)

exp( f(x_t)-f(x) / T ) elsewise

}

What is the probability of ending at any point in a simulated annealing iteration?

exp(f(x)) / Z_T

This is known as the Boltzmann Distribution and is seen frequently in engineering.

As T is high it smooths out the probability density function; as the temperature falls it will narrow the probability band onto the optimum.

Describe the genetic algorithm flow.

Start with your initial population P_0 of size K.

Repeat until converged:

{

Compute fitness of all x in P_t

Select the ‘most fit’ in a consistent manner.

Pair up individuals, replacing ‘least fit’ via crossover / mutation.

}

What are different kinds of crossover?

- Uniform crossover: Each bit has a uniform probability of swapping.
- Single point: Head / tail shifted.
- Multiple point

What is the probability model for MIMIC?

P = {

1 / z_theta if f(x) >= theta

0 elsewise

}

P^(theta_min) (x) -> uniform distribution over all points.

P^(theta_max) (x) -> uniform distribution over global optima.

Describe the algorithm for MIMIC

Rising water keeps structure memory and only keeps the high points.

What is a short, idiomatic, descriptive way of separating supervised from unsupervised learning?

Supervised learning uses labeled training data to generalize labels to new instances and is function approximation. Unsupervised makes sense out of unlabeled data and is data description.

How is domain knowledge injected in most clustering algorithms?

The measure of pairwise distance or similarity.

Describe a basic clustering problem.

Given a set of objects X with inter-object distances D output a set of partitions P_D in a consistent manner such that two points in the same cluster are in the same partition.

What are some examples of trivial clustering partition methods?

- A partition for every point.

2. A single partition.

What is Single Linkage Clustering (SLC)?

A heirarchical (top down / bottom up) agglomerative (slowly fold points in) clustering algorithm

Describe the Algorithm and Runtime Complexity of Single Linkage Clustering

To create k clusters:

- Consider each object (point) a unique cluster (n objects exist.)
- Define the intercluster distance as the distance between the closest points in any two clusters. Note that two points within the same cluster have an intercluster distance of 0.
- Merge the two closest clusters.
- Repeat n-k times to make k clusters.

You eventually construct a dendrogram / tree with as many root nodes as k. As k approaches 1 a fully fleshed tree will join at a single root.

The runtime complexity is O(n^3) because there are n points which you must loop over to calculate distances for n-1 points (the distance to yourself is zero!) during the distance calculation step. You repeat this n-k times and as k approaches n/2 this will approach O(n^3) as a theoretical max and O(n^2) as a theoretical min.

Describe the Algorithm and Runtime Complexity of K-Means Linkage Clustering

To create k clusters:

- Initialize k centroids in the manner you choose (intelligently / randomly / …)
- Assign all n points to the nearest centroid. This requires n * k distance calculations
- Recalculate the centroids of each cluster by setting the centroid coordinates to be the intra cluster mean of the points.
- Go to step 2 and repeat until
*convergence*

The runtime complexity for this is a little more complex. As k approaches n / 2 the complexity for step two will approach O(n^2) and *this* step is repeated until convergence. Intelligent seeding could get you close to an answer and run in far under n iterations but it’s likely a safe assumption to say a theoretical max of O(n^3) and a min of O(n^2).

Does K-Means Clustering Always Converge?

Yes, given enough time, and possibly only to a local minima depending on centroid initialization. Many implementations will run j randomized trials with different seeds. Some implementations will enforce initial clusters to be distant from one another which generally improves results.

Does K-Means Clustering Always Converge to a Good Answer?

What is good? K-means is always going to partition your space in such a manner that the points which are ‘close’ to each other as defined by your distance metric are in the same cluster assuming that k was selected correctly.

What is the Usual Distance Metric for K-Means?

Euclidean, though others may be used. Any smooth measure will work. (Source)

How is K-Means Related to Expectation Maximization?

K-Means can be seen as a special case of EM with a small all-equal, diagonal covariance matrix.

Why does Dimension Reduction Generally Improve K-Means Clustering Results?

K-Means, as a result of using Euclidean distances, suffers from the curse of dimensionality and distances will quickly exaggerate as dimensionality rises. Reducing the dimensionality can *help*, though possibly not solve, this issue.

Define K-Means Mathematically

P^t(x): Partition / cluster of object x

C^t_i: Set of all points in cluster i -> {x s.t. P(x)=i}

- Center^t_i = \sum(y in C_i)/|C_i| (This is a centroid, calculated with a mean)
- Center^O_i - > P^t(x) = argmin_i ||x-center^t_i||^2_2

To run this seed clusters and then repeat 2, 1 until convergence.

How are clusters scored?

Tightness / deviation is a good metric.

One metric is intra cluster squared distances:

E(P, center) = \sum_x ||Center_P(x) -x||^2_2

Which Optimization Routine is K-Means Most Like?

Hill climbing; K-Means moves in a single direction (minimizing intra cluster squared error) and only towards a peak.

What is the Behaviour of the K-Means Error Curve as the Algorithm Progresses?

It is monontically non-increasing. Every time cluster centroids are shifted the error term (sum of intra cluster squared distances) will never decrease.

How Can You Avoid Getting Stuck in Local Optima in K-Means?

Either conduct random restarts or intelligently spread the centroid seeds across the dataset.

What is Soft Clustering?

A way of using probability to assign points to clusters.

What is the Algorithm and Runtime for EM?

- Expectation:

Derive the expected value of z from the centroid means by calculating the normalized probability for every point that it was drawn from that centroid defined as a probability distribution. (In this case a normal distribution.)

Across i points and j centroids:

E[z_ij] = (P(X=x_i|mu=mu_j)) / (sum_(i=1 to k) P(X=x_i|mu=mu_j))

P(X=x_i|mu=mu_j) = e^{-1/2 (x_i - mu_j)^2 sigma^2}

- Maximization:

Derive new centroid means by calculating the average weighted by the probabilities created in the expectation sep.

mu_j = sum_i(E[Z_ij] * x_i) / sum_i(E[Z_ij])

The runtime is similar to K-Means, but it’s slower.

How are EM and K-Means Alike?

K-Means *is* EM with some assumptions made.

What are some properties of EM?

- Monotonically non-decreasing likelihood.
- Does not converge (define a threshold for convergence).
- Will not diverge.
- Can get stuck in local optima.
- Can work with any distribution if E,M are solvable.

What are the three primary clustering properties of interest?

Richness, scale invariance, and consistency.

You can always have two of them, but not three. With a little hand waving you can get pretty close.

Why do we do feature selection

Discover knowledge and add interpretability.

Reduce or remove the curse of dimensionality; the amount of data necessary is exponential in the number of features (2^n)

What is the upper bound of computational complexity for finding a subset of features from n features?

2^n

This is NP hard.

What is filtering?

Filtering is where the feature selection algorithm is run, once, up front and the filtered features are passed on to the learner.

What is wrapping?

Wrapping is where the feature selection algorithm and the learner work cyclically, informing the other process.

What are the differences between wrapping and filtering?

Speed: Filtering is fast; you don’t care what the learner wants and this process ignores the learning problem. This is recommended for a well known problem. Wrapping is *slow*

What algorithms have wrapping embedded in them?

Decision trees and boosting implicitly filter.

How is filtering done?

Define a metric / criterion such as information gain to determine the usefulness of a set of features. You could use variance or entropy. Run a neural net and prune the features with low weights. This leaves you with ‘useful’ features.

How is wrapping done?

Similarly to filtering but it is an iterative process that *improves* on some metric. Run the learning algorithm with the down selected features and use the SSE as a point in an optimization surface. You could use many different optimization routines like RHC.

What is ‘Searching’ as it pertains to feature selection?

Forward search: Start with no features. Add the single feature which most improves the metric of concern (i.e. SSE from your classifier / regressor). Add another single feature in the same manner. Repeat. This is a form of hill climbing.

Backward search: Start with all features. Remove the single feature which, when removed, most improves the metric of concern.

Combination: Both.

When is a feature considered strongly relevant?

When removing it degrades the bayes optimal classifier.

When is a feature considered weakly relevant?

When it is not strongly relevant and there exists some subset of features where adding this feature back in improves the bayes optimal classifier.

When is a feature considered irrelevant?

When it is not strongly or weakly relevant.

How does usefulness of a feature differ from relevance?

Usefulness measures the effect of the feature on a particular predictor. It’s about minimizing error given an algorithm; it could be irrelevant but useful!

What is feature transformation?

Preprocessing a set of features to create a new (hopefully smaller and or more compact) set of features while retaining as much (relevant / useful) information as possible.

How does feature transformation differ from feature selection?

Although the goals are the same (new, relevant, and smaller set) feature selection is a subset of transformation. Transformation can be an arbitrary process but transformation is limited to linear transforms (normally).

How can you describe feature transformation mathematically?

x -> F^n ~ F^m

This takes instances down from some n dimensional to m dimensional space. (m is normally less than n)

P^{T}x

The transformation operator is normally a linear operator.

Does transformation always reduce the number of features?

No, you can project into higher dimensional space.

What is an implicit assumption in feature transformation?

You don’t need all the original features; you need a smaller set that might require bringing in information across the features.

Why should feature transformation be done explicitly outside of the learning algorithms?

You can minimize polysemy and synonmy.

What is polysemy?

An effect that leads to false positives. (A word can have many meanings)

What is synonmy?

An effect that leads to false negatives. (Many words can have one meaning)

What is the general aim of PCA?

To find the axes of maximum variance. What does this mean? It means it minimizes the L2 norm (SSE) from the axis to the points.

What is a requirement of the second principal component?

It must be orthogonal to the first.

What is a mathematical technique to find principal components?

Singular value decomposition.

What are some of the properties of the eigenvectors associated with PCA?

- They are guaranteed to be non-negative

2. They monotonically decrease.

Why would you do dimension reduction to a dataset?

You can remove noise and improve the accuracy of a classifier.

What is ICA?

Independent Components Analysis. It attempts to maximize indepence among the new dimensions by finding a linear transform of the old space where the new features are statistically independent by maximizing kurtoses.

What is mutual information?

A measure of how much one variable tells you about another.

What are major differences between PCA and ICA?

PCA guarantees your new axes / features will be mutually orthogonal.

ICA guarantees your new features will be mutually independent.

PCA maximizes variance.

ICA maximizes mutual information.

PCA returns an ordered set of features.

ICA returns an unordered set of features.

Can PCA do blind source separation?

No, it’s terrible at it. ICA can do it quite well.

Which process (PCA / ICA) is directional?

ICA

Which process (PCA / ICA) can process different answers every iteration?

ICA

Which process (PCA / ICA) finds local features / edges?

ICA

Which process (PCA / ICA) finds global features?

PCA

What are some alternatives to PCA / ICA?

RCA (Random Component Analysis) or Random Projection. It projects onto random axes, and turns out to work well.

LDA (Linear Discriminant Analysis). This finds projections which discriminate based on the label. Come up with a projection that puts things into separate clusters. LDA cares about the label.

How can we represent a message in bits in the shortest length?

Take the highest probability letter and assign it the bit 0 (or 1). The other goes down the right hand side of the tree and the *next most common* letter get 0 (or 1) for its *second* bit. Repeat ad nauseum.

Now you have something like:

- A = 0
- B = 111
- C = 110
- D = 10

The expected message size is $\sum P_i Length_i$ or $0.5 * 1 + 0.25 * 2 + 0.125 * 1 = 1.125$

This is called variable length encoding.

What is the formula for entropy and calculate entropy for our toy example.

$H(s) = -\sum_s P(s) log(P(s)) = -(0.5log(0.5) + 2 * 0.25log(0.25) + 0.125log(0.125) = -1.875$

Note these are log base 2.

What is the formula for *conditional* entropy?

$H(X|Y) = -\sum P(X|Y) log(P(X|Y))$

Note these are log base 2.

What is the formula for entropy given two independent variables?

$H(X,Y) = H(X) + H(Y)$

What is the formula for mutual information?

$I(X,Y) = H(Y)-H(Y|X)$

How should mutual information be interpreted?

High values of I indicate that Y depends heavily on X and there is a lot of mutual information.

Low values indicate that Y does not depend on X (it is an independent variable) and there is no mutual information.

What is KL divergence?

Kullback-Leibler Divergence is a distance measure used between two distributions.

$KL = - \integral P(X) log(P(X)/Q(X))$

It’s not a *complete* distance measure because it doesn’t follow the triangle law (I still need to look this up).

In ML we use a well known distribution for P and sample from Q to determine difference.

Describe how a Markov Decision Process works?

MDPs capture world information by dividing the world into states (s) which are a combination of the variables that describe your agents behavior at a given point in time.

Every state has a set of actions (a_s) that are available for it to use and a set of following states (s’) that it could transition into with some probability T(s,a,s’). Your probabilities across s’ from s must sum to 1. $\sum_si T(s,a,s’) = 1$

States: S Model: T(s, a, s') ~ Pr(s'|s,a) Actions: A(s), A Reward: R(s), R(s,a), R(s,a,s') Policy: $\Pi(s)$ -> a $\Pi^* $

What is the ‘Markovian property’?

Isn’t that from the avengers movies?

Just kidding…

You don’t have to condition on anything except the present! The past doesn’t matter.

What types of processes can be modeled as an MDP?

Almost anything, but given too much information it can get unwieldy.

How is domain knowledge encapsulated in an MDP / reinforcement learning process?

Inside the reward function..

What defines an optimal policy?

It maximizes total reward over a ‘lifetime’.

What is the ‘Temporal Credit Assignment’ problem?

Associating an action with ‘final rewards’ given that in this state you took this ation. This gives a sequence, a time, and a final reward.

How, in an MDP, do we prevent infinite reward accrual and avoiding termination?

Use a discounting factor (commonly called $\gamma$.)

hat is the Bellman Equation?

$U(s) = V(s) = R(s) + \gamma max_a \sum_s T(s,a,s’) U(s’)$

How is value iteration done?

Using the Bellman equation…

$\hat{U}__{t}(s) = R(s) + \gamma max_a \sum_s’ T(s,a,s’) \hat{U}__{t}(s’)$

Start the process initializing values / utility for every state to some value (0 is common.)

Then, essentially, for every state take the reward for that state and add the *discounted* sum of utilities for every state that it could transition into multiplied by the probability of getting to that state. This has the effect of penalizing bad routes over time.

How is policy iteration done?

Using the Bellman equation…

$\hat{U}__{t}(s) = R(s) + \gamma max_a \sum_s’ T(s,\pi_t(s),s’) \hat{U}__{t}(s’)$

$\pi_{t+1} = argmax_a \sum T(s,a,s’) U_t(s’)$

Notice that in this case the action is a function of the policy that you have *chosen*.

Start with a policy (randomly selected works fine.)

Evaluate the Bellman equations given that selected policy.

For every state update the policy to the *most optimal* action that could be taken in that policy given the utilities calculated.

Reevaluate and reupdate, repeat ad nauseum until convergence.

Why would you use policy iteration instead of value iteration?

It reduces the set of equations from n equations with n unknowns to n equations with n unknowns that is *linear*.

Now you can use matrix operations to solve for the policy in each iteration.

Policy iteration converges in less iterations than value iteration normally.

What are the two general components to RL?

A modeler (take the transitions and build a model) and a simulator (run the model to build transitions).

What are the three approaches to RL discussed?

- Use states into $\Pi$ to get actions (Policy)
- Use states into $U$ to get values (Value)
- Use states and actions into T,R to get the next state and reward (Transition / Reward)

Reinforcement algorithms that are searching in policy space are very indirect and are a temporal credit assignment problem. Those that target values learn to map states to values. Using transitions and rewards are model based learners and can be couched as a supervised learning problem.

What is the Q function?

$Q(s,a) = R(s) + \gamma \sum_s’ T(s,a,s’) [max_a’ Q(s’,a’)]$

This is the value for arriving in state s, leaving via a, and proceeding optimally thereafter.

What is a ‘transition’?

Moving from state s to s’ given some action and yielding some reward.

What is the Q learning update function?

$\hat{Q}(s,a)

What is the Q function calculating?

The average value you will receive for following the optimal discounted policy after you take this action in this state.

Does Q learning converge?

Yes, IF every state / pair tuple is visited infinitely often.

What is the biggest tradeoff in Q learning?

The exploitation / exploration tradeoff. Exploitation, or using what you know, tends to lean towards the states you’ve seen with high rewards. Exploration has the possibility fo finding *new* state / action pairs with higher reward!

One agent has conflicting objectives.