Supervised Learning

Use labeled training data to generalize labels to new instances

(function approximation)

unsupervised learning

making sense out of unlabeled data

data description

Single Linkage Clustering

- Consider each object a cluster (n objects)
- Define intercluster distance as the distance between the two closest two points in the two clusters
- Merge two closest clusters
- Repeat n-k times to make k clusters

Intercluster Distance in Single Linkage Clustering

the distance between the 2 closest points in the 2 clusters

Inter cluster distance represents

domain knowledge

T/F Single Link Clustering is Deterministic

True. Unless there are ties, it will give us an answer.

In SLC, if distance can be represented by edge link of a graph, what algorithm is it the same as?

Minimum Spanning Tree

What is the running time of Single Link Clustering (SLC) in terms of n points and k clusters

O(n^3). O(n^2) to look at all distances to find closest pairs and have to do this n times

Issues with SLC

Could end up with weird clusters that surround other clusters

k-means clustering algorithm

- pick k centers (art random)
- each center “claims” its closest points
- recompute the centers by averaging the clustered points
- repeat until convergence

When does k-means converage

When recomputing the centers does not move

T/F the centers have to be a point in the collection of objects

False

k-means in euclidean space process

circle between partition assignment and recalculating the centers

Which optimization algorithm is most like k-means? HC, GA, or SA

Hill Climbing. You take a step towards a configuration that is better than you have before

K-means runs out of configurations and converges in finite time when

finite configurations, break ties consistently, and never go into a configuration with a higher error

Big O for Each iteration of k-means clustering is

polynomial (pretty fast). O(k n): look at distance between each k center and n points and run through all the points to redefine clusters. could also have extra d if d dimensions

Big O for Finite iterations of k-means

exponential O(k^n). first of n objects can be in any of k clusters, 2nd of n objects can be in k clusters, etc. In practice, it is much faster

If ties are broken consistently in k-means then error

decreases [with one exception]

T/F k-means cannot get stuck

False (local optima can still happen)

How to avoid k-means getting stuck

random restart OR do initial analysis and pick centers in good spots

Soft Clustering Overview

Points can be probabilistically from one cluster or another

Soft Clustering Idea

Assume the data was generated by

1. select one of k gaussians uniformly [fixed known variance]

2. sample x from Gaussian

3. repeat n times

Task: find a hypothesis that maximizes the prob of the data (ML)

What is the maximum likelihood gaussian (soft clustering)

The ML mean of the Gaussian is the mean of the data

Expectation Maximum

Has two phases expectation which is soft clustering and maximization

How is expectation maximization like k-means

k-means if cluster assignments use arg max (prob of 0 or 1). It improves in a probabilistic metric like k-means improves in squared error metric

Properties of Em

- Monotonically non-decreasing likelihood (it’s not getting worse)
- Does not converge b/c infinite configurations b/c infinite number of probabilities (partially does)
- Will not Diverge (numbers don’t blow up and become infinitely large)
- Can get stuck (local optima)
- Works with any distribution (If EM solvable)

Clustering Properties

- Richness
- Scale Invariance
- Consistency

Richness

For any assignment of objects to clusters, there is some distance matrix D such that your clustering algorithm returns that clustering.

Scale Invariance

scaling distances by a positive value does not change the clustering

Consistency

Shrinking intra cluster distances and expanding intercluster distances does not change the clustering

Impossibility Theorem

No clustering can achieve all three of richness, scale invariance, consistency

Curse of dimensionality

Data needed grows exponentially with features (2^n)

Big O of feature selection

np-hard, 2^n, exponential

Two approaches to feature selection

filtering and wrapping

filtering overview

set of features as input, passes to same search algorithm that outputs fewer features

wrapping overview

set of features, search over some subset of features, hands to leaning algorithm which reports how well those features performed

Pros of filtering

speed: faster than wrapping because you don’t have to worry about what the learner does.

Ignores the learning problem

cons of filtering

look at features in isolation.

Pros of wrapping

takes into account model bias and learning itself

con of wrapping

extremely slow compared to filtering

ways to do feature filtering

- information gain
- variance, entropy
- “useful features”
- independent/non redundant
- ex: DT, NN (trim features with low weights)

ways to do feature wrapping

- hill climbing
- randomized Optimization
- forward search
- backward search

forward search

Look at all features in isolation, keep whatever feature is best. Then look at adding any of the remaining features and choose whichever is best in combination with first feature, etc. This is similar to hill climbing

backward search

start with all the features and figure out which one can you eliminate, one at a time until you get to a subset that performs well

Relevance: strongly relevant

xi is strongly relevant if removing is degrades the Bayes optimal classifier

Relevance: weakly relevant

Xi is weakly relevant if

- not strongly relevant
- and there exists some subset of features such that adding xi to the subset improves the bayes optimal features

Relevance measures effect on

Bayes optimal classifier. INFORMATION

Usefulness measures effect on

a particular predictor. ERROR given model/learner (c is useful in neural nets)

feature transformation

the problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible

T/F Feature selection is a subset of feature transformaiton

True

feature transformation vs feature selection

transformation: linear combination of original features like 2x1 + x2 as 1 feature. selection: have x1, x2, x3, x4 and only take x1 and x2

ad hoc information retrieval problem (google problem)

a whole bunch of databases and you want to retrieve the subset of documents that are relevant to the query (features)

ad hoc - you don’t know what they queries are

polysemy

words can have multiple meanings

ex: car- Vehicle and also stupid thing in LISP

results in false positives

synonomy

many words mean the same thing. ex: car and automobile are same thing

results in false negative

principal components analysis overview

finds directions that maximizes variance (principal component)

and finds directions that are mutually orthogonal

principal component analysis properties

- maximize variance (principal component)
- mutually orthogonal - global algorithm
- you can prove best reconstruction (minimize l2 error by projecting onto a new line)
- you can throw away dimensions with the smallest eigenvalues

best reconstruction in principal component analysis (PCA)

it is simply a re-labeling of the dimensions. The directions just tell you how far along each axis it is. It is just a linear rotation of the original dimensions. the new dimensions that PCA returns does not lose any information. It’s just a rotation of the data. You can reconstruct your original data with the information

mutually orthogonal in PCA

Which means it’s a global algorithm. All of the directions and the new features that they find have a big global constraint → have to be mutually orthogonal.

What does PCS fundamentally ask

Is there another basis, which is a

linear combination of the original basis, that best expresses our data set?

eigenproblem

a computational problem that can be solved by finding the eigenvalues and/or eigenvectors of a matrix. In PCA, we are analyzing the covariance matrix (see the paper for details)

PCA vs ICA overview

PCA: correlation, maximizes variance => reconstruction

ICA: independence, linear transformation into new feature space such that each new feature is independent of each other and mutual information as high as possible between original and new features

Blind Source Separation Problem (cocktail)

It is difficult to listen to one conversation at once when there are a lot of conversations going on at the same time. People are hidden variables in here and microphones are the observables

Independent Components Analysis (ICA) goal

figure out the hidden variables from observables

PCA vs ICA

- mutually orthogonal
- mutually independent
- maximal variance
- maximal mutual information
- ordered features
- bag of features

- mutually orthogonal: PCA
- mutually independent: ICA
- maximal variance: PCA
- maximal mutual information: ICA
- ordered features: PCA
- bag of features: ICA and ish PCA

Which is global and which is local: PCA vs ICA

PCA is global (in face finds brightness then average face)

ICA finds local/structure (in faces it finds noses then eyes, in the natural world, it finds edges. In documents it’s topics)

RCA: Random Components Analysis

Generates random directions

projects data into those directions

works very well if the next thing you do is classification

What is the big advantage of RCA

Fast (much faster than PCA or ICA)

Other advantages: cheap, simple, easy,

LDA: Linear Discriminant Analysis

Find a projection that discriminates based on the label. Pays attention to how the resulting components are used

“In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes.”

ICA is important for finding

structure and edges

PCA vs ICA: which uses primarily Probability vs primarily Linear algebra

ICA: probability

PCA: linear algebra

Markov Decision Process: Model

Transition State. T(s,a,s’) is transition probability of being in state s, take action a, and end up in state s’. This function produces the P(s’|s,a): The probability of a new state given your current state and given the action you take

Markov Decision Process: Actions (A)

The things that you can do in a state (up, down, left, right)

Markov Decision Process: Reward

Scalar Reward for being in a state, taking action, etc. (Red and green states in board example). Represents our Domain Knowledge

Markov Decision Process: Policy

A policy is a solution to Markov decision process. Tells you what action to take from a particular state

Markov: Infinite Horizons

We assumed there is not a time that is clicking. If there is then policy would depend on state and time

Markov: Utility of Sequences

if I prefer one sequence of events today, then I prefer the same sequence tomorrow

Markov: discounted

discounted rewards allows us to go infinite distance in finite time because it is geometric. Rmax/1-gamma

Markov: Utility of a policy at a state

the expected outcome from that point if we follow the utility function f

Markov: utility vs reward

Utility: reward for that state and all the reward from that point on (Long term)

Reward: Immediate: reward (Short term)

Bellman Equation

Key equation for RL. A recursive equation that defines the true value of being in some particular state including policy, rewards, transition matrix, gammas, etc. It is the utility of the state.

Markov Decision: Value Iteration

start w/ arbitrary utilities, update utilities based on neighbors, repeat until convergence

Markov Decision: Policy Iteration

start with some policy (guess), evaluate its utility, improve policy by updating it to the policy that takes the action that maximizes the expected utility based on what we just calculated. Repeat

What is Q-Learning

Evaluating the Bellman equations from data

Markov Property (2 things)

- only the present matters

- things are stationary (the rules don’t change)

Advantage of policies in Markov decisions vs “plans”

They are robust to the stochasticity of the world

Markov: Why use gamma^t in utility of sequences

Rewards are substantial at first and quickly trail off

RL: What is Q(s,a) function in plain english

the value for arriving in s, leaving via a, and proceeding optimally thereafter

RL: What happens if you set alpha (learning rate) of 1

Full learning, forget everything you learn and just jump to new v

RL: What happens if you set alpha (learning rate) of 0

Won’t learn at all. Will just assign v to v

Q-learning family of algorithms varies on what 3 things (generally)

- how initialize Q hat
- how to decay alpha sub t
- how to choose actions

How to choose actions in Q-learning

Can choose Q hat which is greedy (always choose best Q-hat), but can hit local mins. Solve this problem with simulated annealing like approach (take random actions sometimes)

Could choose Q hat randomly, but then you are not using what you learned about Q hat so far. You learn optimal policy, but don’t follow it.

Problem with random restarts in Q-learning

Very slow! takes a long time to get to infinity without restarts.. even longer with

Fundamental Resul Theorem

in a 2 player, zero sum deterministic game of perfect information, minimiax = maximin and there always exists an optimal pure strategy for each player (with rational agent)

nash equilibrium

if and only if each player won’t change their strategy given all the other player’s strategy

T/F Any N.E. will survive elimination of strictly dominated strategies

True

T/F There is always a N.E. (maybe mixed) for finite strategies and finite player games

True

T/F In the n player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination it he unique N.E

True

Tit-for-tat

First round = corporate. All future rounds = copy the opponent’s previous move. A strategy for infinite games.

Folk Theorem in Math

General idea: in repeated games, the possibility of retaliation opens the door for cooperation

Folk Theorem: In math: results known, at least to experts in the field, and considered to have established status, but not published in complete form

game theory: feasible region

average payoff of some point strategy

Game Theory: Folk Theorem

Any feasible payoff profile that strictly dominates the minimax/security profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor. Proof: if it strictly dominates the minimax profile, can use it as a threat. Better off doing what you are told

Grim Trigger

As long as you cooperate, you get mutual benefit. If you ever defect “deal out vengeance” forever

Implausible Threat

The vengeance could cost more than trying to cooperate so it’s not realistic to always punish opponent

Game Theory: Subgame Perfect

Each player is always taking a best response independent of history

Game Theory: Pavlov

cooperate if agree, defect is disagree

T/F is Pavlov sub game perfect?

Yes. No matter what state they are in, the average reward is mutual cooperation

Computational Folk Theorem

Can build pavlov like machine for any game. construct subgame perfect nash equilibrium for any game in polynomial time. This is because if it’s pavlov, we can get to pavlov quickly, and if not it is zero-sum like (solve an LP), or at most one player improves

bimatrix game

2 players and each player has its own reward matrix and is average reward repeated game

what makes stochastic games more interesting

actions that the players take impact not just the rewards but also future states

How can you use bellman equation in zero sum stochastic games

use minimax on q value instead of max

Properties of Q learning in zero sum stochastic games

value iteration works minimax-Q converges under same conditions that Q does unique solution to Q* policies can be computed independently update efficient Q functions sufficient to specify policy

General Sum Stochastic Games

Can’t do minimax anymore. Instead use Nash of Q.

What properties change for general sum stochastic games

Value iteration doesn’t work

Nash doesn’t converge

No unique solutions to Q*

Policies can not be computed independently

The update is not efficient unless P=PPAD

Q functions are not sufficient to specify policy

Expectation Maximization - soft clustering / expectation step

We know the means (assume the means) and then calculate how likely it is that the data came from the means. Knowing means and variances allows you to calculate which point came from which cluster. For each point, which cluster is it from?

Expectation Maximization - Maximization steps

We know the clusterings, so calculate the means and variances for these clusters.

monotonically non decreasing

finding higher and higher likelihoods

Why care about feature selection

- knowledge discovery - it is useful to keep understand what features are meaningful (interpretability and insight)
- curse of dimensionality

What algorithm looks at feature selection

Decision Trees (a type of filtering) and in particular, information gain

What algorithm did we learn in the first half of the course that does feature transformation

neural nets

principal component analysis. Why can throw away eigenvalues

You project onto n dimensions with n features. As you move from 1st principal component to n principal components, the eigenvalues monotonically non-increasing. You can throw away the ones with the least eigenvalue = throw away the ones with the least variance

Con of PCA

If one of your original dimensions is directly related to the data but there is gaussian noise, PCA will end up throwing away that original dimension

PCA is like filtering or wrapping

filtering. It transforms into a new space where features can be filtered

T/F PCA goal is to find independent projects

False. It can happen to find independent projects. It’s finding uncorrelated dimensions

How can ICA want to maximize mutual information and also mutually independent

Max mutual info between new features and old features

Mutually independent between new features

What feature transformation problem does much better in blind source separation problem and is directional

ICA. Directional meaning it gives very different answers if you rotate the matrix that it is given.

RL: 3 different types of RL

Policy search: map states to actions through policies

Value function based: map state to value through utilities

Model-Based: map states and actions to new states and rewards through transitions and rewards

RL: Q-Learning what is U(s) and pi(s)

U(s) = max over all actions of Q(s,a) in that state pi(s) = argmax over all action of Q(s,a)

RL: What is Q-learning

evaluating the Bellman equation from data. We know the transitions, not the model. We know

Q-learning converges if

- we visit s,a infinitely often so it needs to run a very long time
- s’ needs to be drawn from the actual transition probabilities T(s,a,s’)
- rewards need to be drawn from reward function R(s)

minimax

Players consider worst case counter. A is trying to max and B is trying to min. So A finds the maximum minimum and B is trying to find the minimum maximum. There always exists an optimal pure strategy for each player

Relaxing Which constraint makes minimax fail?

change from perfect information to hidden information. Minimax works in 2-sum zero-sum deterministic OR nondeterministic games of perfect information with pure strategies

mixed strategy vs pure strategy

mixed strategy, you choose with probability

Game Theory: finite state strategy. How to solve?

You can use MDPs! Only the last state matters, etc.

minmax profile

pair of payoffs, one for each player, that represents the payoff achieved a player defending itself against a malicious adversary

security level

when you can have mixed strategy with minmax profile

q-learning does not work with general stochastic games, what could work?

repeated stochastic games (folk theorem)

cheap talk -> correlated equilibrium. players can talk a little bit

cognitive hierarchy -> best response. Rather than solve for equilibria, best response to what you believe the other player will do

side payments - (cocoa values) players can pay the opponent to help them