How does single linkage clustering work? (13.4)

- Consider each object (sample) a cluster (n objects)
- define intercluster distance as distance b/t the
__closest__two points in clusters with different labels - merge two closest clusters
- repeat n-k times to make k clusters

If two points are in the same cluster, the intercluster distance is 0.

You end up creating a tree structure with the root being everything in a single cluster.

How you define distance is one way of inserting domain knowledge.

This typically leads to chain like structures.

When might median be a good distance metric in single linkage clustering? (13.4)

When the value of the distance itself doesn’t really matter, but the order is what is most important.

What is the a metric vs non metric statistic? (13.4)

Metric: the specific values matter a lot (average, the details of the number matter)

Non-Metric: the specific values don’t matter, but rank order does (eg median)

What’s the running time of Single Linkage Clustering? (13.5)

O(n^3) because:

for k times:

find distance b/t each point and next closest point which is O(n^2) and do that n times.

How does K-Means clustering work? (13.7)

- Pick K rand points to be the centers of the k clusters
- Each center “claims” its closest points
- Recompute centers by averaging clustered points (new center doesn’t have to be a point!)
- Repeat until clusters no longer change

Why does K-means clustering converge? (13.11)

The error is monotonically non-increasing.

When updating the partition:

•P(x) = argmin_i || x - center_i ||^2

•You’re finding the partition that is closest to x

• You only move a point from from cluster to another if that distance (error) gets smaller. So the error can only every be 0 or decreasing

When updating the center of each partition (without changing the partition):

•center_i = sum(y) / |C_i|

•Average minimizes the squared error (take derivative of E and set it equal to 0)

For this to work you must have some way of breaking ties that is deterministic and consistent.

You will never revisit a configuration you’ve already looked at.

What randomized optimization technique is most like K-means clustering? (13.10/11)

Hill Climbing - each time you set P(x) for all points and then readjust the center, you are always either decreasing the error or the error doesn’t change. So always improvement or no improvement, but never de-provement.

What are the properties of k-means clustering? (13.12)

- Each iteration is polynomial: O(k*n) distance between the k centers and points (w/ extra factor of d if d dimensional points)
- Finite (exponential) iterations: O(k^n)
- Error decreases if ties are broken consistently
- Exception is that if it doesn’t improve, the next time you would see if has converged (double check that understanding?)

Can k-means clustering get stuck and if so how could you avoid it? (13.12)

Yes it can get stuck!

a(b) (c)(d)

e f

if (x) are the cluster centers, then (b) is going to claim a, b, e, and f and c and d will just stay where they are.

Can be avoided via random restarts or selecting points on the fringe or making sure the cluster centers are as far away form each other as possible.

What is soft clustering and how does it work? (13.14)

This is similar to bayesian learning where we are trying to find a hypothesis that maximizes the probability of the data.

Assume data was generated as follows:

• Select one of K gaussians [ with fixed variance] uniformly

• Sample x_i from that gaussian

• Repeat n times

Goal: Find a hypothesis h = that maximizes probability of the data (Max. likelihood)

This means that each data point __could__ probabilistically belong to more than one cluster.

What is the ML mean of the gaussian if there is just one K? (13.15)

If you know all the data is coming from the same gaussian, then the ML mean is just the mean of the data.

What is expectation maximization (EM)? (13.16/17)

Moving back and forth between soft clustering and computing the means for the soft clustering.

This is conceptually much like k-means, but you’re assigning the probability of being in each cluster to each given datapoint.

What are the properties of EM? (13.18)

- Each iteration of EM is a monotonically non-decreasing likelihood (finding higher and higher likelihoods)
- Does not
__have__to converge, but it will not diverge either - Can get stuck (happens ALL THE TIME in practice; many local optima. Solution: random restarts)
- Works with any distribution (if E, M are both solvable)
__usually__the E step is the harder part and M is just basic math

What are the desirable properties of clustering algorithms? (13.19)

- Richness: For any assignment of objects to clusters, there is
__some__distance matrix D that P_d returns that clustering. You can’t force it to have a specific number of clusters. can represent any number of clusters. (13.20). - Scale-Invariance: Scaling distances by a positive value does not change the clustering (going from miles to kilometers may change the values, but it doesn’t change the clustering)
- Consistency: Shrinking intra-cluster distances (dist b/t samples in same cluster) and expanding inter-cluster distances (dist b/t the clusters) does not change the clustering, P_d = P_d’

Which clustering scheme can achieve all the desirable properties of k means clustering? (13.21)

NONE! Impossibility theorem (Kleinberg) says no clustering scheme can achieve all three:

• richness

• scale invariance

• consistency

Why do we bother with feature selection? (14.2)

- Curse of dimensionality

* Allows you to focus on the features that matter and ignore those that don’t

What is filtering vs wrapping in feature selection? (14.4)

Filtering:

N Features -> Search Algorithm -> Fewer Features

• The criterion of how you know if you’re doing well is buried within the search algorithm

• Can use the labels in your filter

Wrapping:

N Features -> [ Search Learning Alg.] -> Fewer Features

• In this case you pass the search results to learning algorithm, it reports back how well it does with those features and this goes back and forth until you come to a solution

What are the pros and cons of filtering and wrapping? (14.5)

Filtering:

+ Speed!

- Speed … you’re really looking at isolated features and aren’t considering feature dependence

- ignores the learning problem

Wrapping:

+ take into account model bias and learning!

- But it’s really really slow

What are some ways in which you could do filtering? (14.6)

- Information gain
- variance entropy
- ‘useful’ features
- independent/non-redundant features

The goal is the make the learning problem easier by reducing the number of features which reduces the number of samples necessary, up until the point where it makes the learning problem harder because you just don’t have enough differentiating features.

What are some ways in which you could do wrapping? (14.6/7)

- Any randomized optimization algorithm
- Forward Search (a form of hill climbing)
- Backward Search (a form of hill climbing)

How does forward search work? (14.7)

- Go through each feature one at a time and pass it to the learning algorithm and get the score for each
- Keep the feature with the highest score
- Now combine that one feature with every other again, passing it to the learner and getting back a score
- Again keep the best combination
- Repeat this until adding a feature doesn’t produce significant gains to the learner

How does backward search work? (14.7)

Say you start with 5 (N) features: 1, 2, 3, 4, 5. In this case you’re trying to eliminate one feature at a time.

• Pass all combinations of N-1 to the learner and get result:

• 1, 2, 3, 4

• 1, 2, 4, 5

•1, 3, 4, 5

• 2, 3, 4, 5

• Keep the best combination and then repeat until your score peaks and then goes down

This is like eliminating people during tryouts.

When is a feature strongly relevant? (14.9)

If removing that feature degrades the Bayes Optimal Classifier, then it is strongly relevant.

That is to say, if you are considering ALL features and you remove this one feature and the BOC degrades, it is strongly relevant.

You can make a feature not strong just by putting a copy in there. (14.11)

When is a feature weakly relevant? (14.9)

If it is NOT strongly relevant AND adding it to some other subset of features would improve the Bayes Optimal Classifier.

So if you have a subset of features and when you add this feature it improve the BOC, it is weakly relevant.

If a feature is not strongly relevant and it is not weakly relevant then it is irrelevant.

What does feature relevance do? (14.10)

It measures the effect on the Bayes Optimal Classifier (increase or decrease). Relevance is about how much information a feature provides.

Example:

If a binary feature C is 1 for all samples, it has 0 entropy and is irrelevant.

What is feature usefulness? (14.10)

Usefulness represents the effect on the error of a model or learning algorithm. This is what we ultimately care about.

Revisit 14.8 - Quiz about minimum features required.

Make sure to understand what’s going on here.

What is the Bayes Optimal Classifier? (TBD) - Probably in the bayesian lectures.

Look this up.

Computes the best label given all probabilities you could computer over all the hypotheses. Any other algorithm has an inductive bias.

The gold standard - the best you could do if you had everything and all the time in the world.

Relevance is usefulness to the

Why is the Bayes Optimal Classifier used to measure relevance? (14.10)

I think this is teased out in this video, but needs to be summarized better.

Computes the best label given all probabilities you could computer over all the hypotheses. Any other algorithm has an inductive bias.

What is feature transformation? (15.1)

Pre-processing a set of features to create a new, smaller (usually) set while retaining as much information as possible (useful, relevant information).

Example, starting with features x1, x2, x3, x4:

- Feature Selection => x2, x3
- Feature Transformation => 2*x1 + x3

What is Principal Components Analysis (PCA)? (15.5)

Finds the direction of maximal variance in the data and finds directions that are mutually orthogonal.

So it first finds the direction of max variance (1st or principal component), then finds the line orthogonal (2nd or 2nd principal component).

It’s like doing a transformation into a new space where feature selection can work (15.7). An eigenvalue of 0 for a particular dimension would mean there is no variance and thus gives you no information (irrelevant, but doesn’t mean its useless).

About finding correlation by maximizing variance and ability to do reconstruction (15.8)

What are the properties of PCA? (15.6)

- It gives you the principal direction of maximum variance
- It gives you orthogonal directions beyond that
- It’s a global algorithm - You’re working across the entire, global space of the problem domain
- Gives you the best reconstruction
- The eigenvalues of successive principal are monotonically non-increasing, which means they tend to decrease. So each successive principal essentially has less variance, and you can throw away the ones with the last variance.
- Can look at eigenvalue to see which dimensions are important

What is reconstruction? (15.6)

If you were to perform PCA and get the principal component and project all the samples only onto that one dimension and then try to re-project onto the original dimension space, it would give you the smallest L2 (squared error) error of any other projection.

By maximizing variance, you’re maintaining distances as best you can for any given dimension.

What is Independent Components Analysis (ICA)? (15.8)

The goal of ICA is create a new set of features from the originals where each new feature is statistically independent from every other new feature, that is mutual information between all new feature pairs is 0 (knowing the value of one tells you nothing about the value of any other), and that it maximizes the mutual information between the new features and the original features.

Describe hidden variables and observables in the context of ICA (15.9/10)

The hidden variables are the sources - they are what you are trying to recover. Thy represent the new, statistically independent features.

The observables are the original features. They represent some linear combination of the hidden variables.

Describe the cocktail party problem in the context of ICA

(Blind source separation problem) (15.10)

There are N people in a room and N microphones. Each person represents a hidden variable while each microphone is a linear combination of each of the people talking.

So the people are the hidden variables and the microphones are the features.

The people are statistically ind ependent because they are producing their own sound waves which are independent of all other people’s sound waves.

What is the difference between feature transformation and feature selection?

Feature selection is a subset of feature transformation. Feature transformation has a goal of trying to reduce the number of features by creating new features (x1, x2, x3 => x1 + 3*x2) while feature selection is reducing the number of features by choosing amongst those which have the most influence on the learning output.

What are the differences between PCA and ICA? (15.12/13)

•Big One: ICA is local and does selects small features (nose, eyes, mouth, edge detection, etc) while PCA is global and would see things like ‘brightness’ of an image, or the ‘average’ of an image.

• ICA is highly directional while PCA is not (you could feed PCA the input matrix in any orientation and because it’s already providing something directional, it would just be a different transformation and you end up with the same results).

•ICA can help you find the underlying structure of the data

* ICA is good at the blind source separation problem while PCA is terrible at it. PCA doesn’t perform

•PCA has ordered features while ICA is more like a bag of features

•PCA’s features are mutually orthogonal while ICA’s are not

•ICA’s features are mutually independent while PCA’s are not

•PCA maximizes variances while ICA maximizes mutual information (b/t original and new features)

• ICA is focused on probability while PCA is usually solved by linear algebra

What are alternatives to PCA and ICA?

- RCA - random component analysis. Generates random direction. It is fast and tends to perform pretty well. The number of final dimensions is usually higher than that of PCA and ICA. Tends to produce correlations.
- LDA - linear discriminant analysis. Finds a projection that discriminates based on the label, so it at least uses the values of the labels as information.

Describe hierarchical complete link clustering

- The measure of distance between clusters is the distance between the farthest two points in each cluster.
- You will still merge clusters based on the shortest distance, but you’re using a max to define that distance.
- Tends to produce spherical clusters with consistent diameter

https://www.youtube.com/watch?v=VMyXc3SiEqs

What is a zero sum game? (19.3)

The sum of the two players rewards is constant. That constant may or may not be zero

constant is 0:

player 1 = + 5

player 2 = - 5

constant is 3:

player 1 = +3

player 2 = 0

player 1 = +1

player 2 = +2

etc…

What does it mean to have perfect vs hidden information? (19.3)

Perfect information means you know what state the game is in (and thus know the state of everything in the game)

What is the minimax strategy? (19.6)

One player is trying find the maximim, minimum (What’s best for itself) and the other player is trying to find the minimum, maximum (what is also best for itself, in a zero sum game)

What is the “Fundamental Result” for a two player zero-sum non/deterministic game of perfect information? (19.7)

Minimax == Maximin and there always exists an optimal, pure strategy for each player.

This assumes each player is going to play the same way, eg, maximize their own rewards, or they are being ‘rational’.

Can be deterministic or non deterministic.

What is a pure vs a mixed strategy? (19.13)

Mixed strategies have some distribution over strategies, as opposed to choosing a fixed strategy.

Ex: 25% holder, 75% resigner

How do you find the solution (expected value) to a mixed strategy? (19.15)

You can plot the lines and then take the maximum of the minimum. This will either be located at the beginning, end, or where they cross.

What is the Nash Equilibrium? (19.19)

A Nash Equilibrium exists if __neither__ player, if chosen at random, would switch their strategy, despite the other play keeping their strategy.

In other words, if no one would want to change their strategy given the choice to do so, you’re in a NE.

This works for both mixed and pure strategies.

What is strict domination? (19.20)

If you would always choose one strategy over another, no matter what, that chosen strategy strictly dominates the others.

What are the rules of determining Nash Equilibrium? (19.21)

- For N player pure strategy game, if eliminating all strictly dominated strategies leaves you with one, then it is the NE
- A NE will survive elimination of strictly dominated strategies
- if there is a finite number of players and a finite number of strategies then there exists at least one NE that may involve mixed strategies.

Will the number of games change the outcome for a minimax strategy? (19.23)

No … If you work it backwards, the last game is determined by the matrix, and thus each previous game is also just determined by the matrix. So it doesn’t matter how many repeated games you play, the last one is the only one that matters and it’s defined by the same matrix.

Does it matter which player starts?

No - you come to the same solution regardless of who starts.

What is the tit for tat strategy for iterated prisoners dilemma? (20.4)

You cooperate in the first round, then copy your opponents __previous__ action every round thereafter.

What are the best strategies to play when facing tit for tat? (20.6)

For a low gamma - you should always defect.

Expected value of game = 0 + -6 * gamma / (1 - gamma)

For a high gamma, you should always cooperate.

Expected value of game = -1 / (1 - gamma)

The gamma where the strategies are equal is 1/6.

You may need to recalculate these with different rewards, but the same concept applies.

What is the folk theorem? (20.14)

“Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a set of Nash equilibrium payoff profile, with sufficiently large discount factor”

What is the minmax/security level profile? (20.13)

The best score you can guarantee yourself against a malicious adversary (eg, defect, defect).

Describe the “acceptable/prefereable” region in repeated games. (20.13)

The zone where a better score can be obtained for both players than either could guarantee themselves if they were playing against a malicious adversary.

What is Grim’s Trigger? (20.15)

Start in mutual cooperation, but if you ever ‘cross the line’ (defect), ‘I will deal out vengeance forever’ (always defect regardless of what you do).

What is subgame perfect? (20.16)

Both plays always play their best response independent of history.

They are not subgame perfect (eg, you’re making an implausible threat) if when the time comes for you to follow through with what your strategy says to do, it would actually be worse for you than doing something different.

You take over the brain of players, set the moves, then release them. Given the choice, would they still follow their strategy or would it be better to do something different? If better to do something different - not subgame perfect.

How does a Pavlov machine work? (20.18)

Cooperate if we agree, defect if we disagree. This is a NE for prisoners dilemma and it’s subgame perfect.

No matter what states two pavlov machines playing against each other are in, they will always end up in mutual cooperation. It makes plausible threats.

Why is the computational folk theorem important? (20.21)

For a bi matrix, repeated game looking at the long term average reward (assuming a high discount), if it’s possible to have a mutually beneficial relationship you can build a pavlov-like machine for any game and use it to construct subgame perfect NE in polynomial time.

- if it’s not possible to have mutual cooperation then it’s a zero-sumlike game
- then we would solve using LP which either produces a NE or it doesn’t and at most one player improves.

What are the properties of zero-sum stochastic games? (20.26)

- you use a minimax(Q*(s’, (a’, b’)) in the bellman equation instead of a max
- value iteration works
- minimax-Q converges
- unique solution to Q*
- policies can be computed independently
- Q Update can be computed efficiently (LP)
- Q functions sufficient to specify policy
- not known whether it can be solved in polynomial time like a standard MDP is known to be able to

What are the properties of general-sum stochastic games? (20.27)

- instead of computing minimax, you would compute the NE in the combined bellman equation (eg, Nash-Q)
- value iteration doesn’t work
- Nash-Q doesn’t converge
- No unique solution to Q*
- policies cannot be computed independently b/c it’s defined as a joint behavior
- update is not efficient
- Q functions not sufficient to specify a policy

:(

How is KM an optimization problem? (13.9)

You’re scoring different configurations.

You’re defining a configuration consisting of a partition and it’s center, then trying to minimize the error, which can be seen as:

Sum_x [ ||center_p(x) - x||^2], or the sum of the squared distance from each point to the center.

It is most like hill climbing

What is required for MDPs? (17.7)

Markov Property: Only the present matters; you don’t care about history

The environment must be stationary: The rules of the environment can’t change