Exam 2 Flashcards

1
Q

Unsupervised Learning

A
  • Data Description
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Single Linkage Cluster - SLC

A
  • Consider each object a cluster (n objects)
  • Define inter-cluster distance as the distance between the closest points in the two clusters
  • Merge the two closest clusters
  • Repeat n-k times to make k clusters
  • How you define the inter-cluster distance = domain knowledge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Single Linkage Cluster Run Time

A

O(n^3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Issues with SLC

A

Might end up with wrong clusters, depending on the distance definition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k-means clustering

A
  • Pick k random center points
  • Each center claims its closest points
  • Recompute the centers by averaging the clustered points
  • Repeat until converge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Configurations (k-means as an optimization problem)

A

center

P

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Scores (k-means as an optimization problem)

A

How much error you introduce by representing these points as centers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Neighborhood (k-means as an optimization problem)

A

The neighborhood of a configuration is the set of pairs where you keep the centers the same and you change the partitions or you keep the partitions the same and you move the centers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Properties of k-means clustering

A
  • Each iteration is polynomial O(kn)
  • Finite (exponential) iterations O(k^n)
  • Error decreases (if ties are broken consistently)
  • Can get stuck, if it started with wrong centers. This is similar to converging to a local optima. One solution is to use random restarts.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Soft Clustering

A

Attaches cluster probability to each point instead of specific cluster

Task is to find a hypothesis that maximizes the probability of the data (Maximum likelihood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Maximum Likelihood Gaussian

A

The Maximum Likelihood mean of the Gaussian u is the mean of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Expectation Maximization

A
  • Assigns a cluster probability to each point
  • Use the new u_j to recompute E[Z_ij]
  • Expectation: E[Z_ij] defines the probability that element i was produced by cluster j.
  • Maximization: u_j defines the mean of cluster j
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

EM properties

A
  • Monotonically non-decreasing likelihood: It’s not getting worse
  • Does not converge
  • Will not diverge
  • Most probably will get stuck in a local optima. Use random restarts
  • Works with any distribution (if E and u are solvable)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Richness (Clustering Properties)

A

For any assignment C of objects to clusters, there’s some distance matrix D such that P_d returns that clustering C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Scale-invariance (Clustering Properties)

A

Scaling distances by a positive value doesn’t change the clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Consistency (Clustering Properties)

A

Shrinking intra-cluster distances (moving points towards each other) and expanding inter-cluster (moving clusters away from each other) distances doesn’t change the clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Impossibility Theorem (Clustering Properties)

A

There’s no clustering algorithm that can achieve the three properties

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Time to select most relevant subset of features M, where M <= N

A

2^N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Approaches to Feature Selection

A

Filtering and Wrapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Filtering (Feature Selection)

A
  • Apply a search algorithm to produce fewer features to be passed to the learning algorithm
  • Faster than wrapping
  • Doesn’t account for the relationships between features
  • Ignores the learning process (no feedback)
  • We can use an algorithm (ex: Decision Tree) that can select the best features, then use another algorithm, with another inductive bias, for learning
  • We can use different criterion to evaluate the usefulness of a subset of features:
    information gain, variance, entropy, eliminate dependent features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Wrapping

A
  • Given a set of features, apply a search algorithm to produce fewer features, pass them to the learning algorithm, then use the output to update the selected set of features
  • Takes into account the learning model’s bias
  • Very slow

We can use different techniques to search:

  1. Randomized Optimization algorithms
  2. Forward selection: select a feature and evaluate the effect of creating different combinations of this feature with other features. Stop once you stagnate. This is similar to hill climbing
  3. Backward Elimination: start with all the features and evaluate the effect of eliminating each feature
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Forward selection

A

select a feature and evaluate the effect of creating different combinations of this feature with other features. Stop once you stagnate. This is similar to hill climbing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Backward Elimination

A

start with all the features and evaluate the effect of eliminating each feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

x_i is <> relevant if removing it degrades the Bayes Optimal Classifier

A

strongly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

x_i is <> relevant if its not strongly relevant or there’s a subset of features S such that adding x_i to S improves the Bayes Optimal Classifier

A

weakly

26
Q

Relevance (Feature Selection)

A

Measures the effect of Bayes Optimal Classifier, how much information a particular feature provides

27
Q

Usefulness (Feature Selection)

A

Measures effect on error, measures if a feature helps in minimizing error given a particular model

28
Q

PCA

A

Example of an eigenproblem which will transform the features set by:

  • Finding the direction (vector) that maximizes variance, this is called the Principal Component
  • Finding directions that are orthogonal to the Principal component
  • Each Principal component has a prescribed eigenvalue
  • PCA gives the ability to do reconstruction, b/c its a linear rotation of the original space that minimizes L2 error by moving N to M dimensions so we don’t lose information
  • We can center the problem around the origin by subtracting the mean of. the data
  • in effect, PCA produces a set of orthogonal gaussians
  • PCA is a global algorithm that is very fast.
29
Q

ICA

A
  • Attempts to maximize independence
  • Tries to find a linear transformation of the feature space, such that each of the individual new features are mutually statistically independent
  • The mutual info b/w any 2 random features equals 0
  • The mutual info b/w the new features set and the old features set is as high as possible
30
Q

Random Components Analysis

A
  • Similar to PCA but instead of generating directions that maximize variance, it generates random directions
  • It captures some of the correlations that works well with classification settings
  • Faster than PCA and ICA
31
Q

Linear Discriminant Analysis

A
  • Finds a projection that discriminates based on the label, it finds projections of features that ultimately align best with the desired output (wrapping function instead of filtering)
32
Q

5 Parts to a Markov Decision Process

A

Markov Decision Problem

  • State - representation
  • Model - Transition model, probability of transitioning to a new state given a state and an action
  • Actions - Things the object is allowed to do in a particular state s
  • Reward - Reward the object gets when transitioning to a state s, which tells it the “usefulness” of transitioning to that state

Markov Decision Problem Solution
- Policy - Takes a state and returns the proper action. A problem might have different policies

33
Q

Temporal Credit Assignment

A

We don’t have instant rewards for each action, we have the problem of figuring out which specific action(s) lead us to a positive/negative outcome

34
Q

A small negative reward will encourage the agent to take the <> path to the positive outcome

A

Shortest

35
Q

A big negative reward will encourage the agent to take the <> path no matter what the result is, so it might end up with a negative outcome

A

shortest

36
Q

You can take <> paths to avoid possible risks if you have an infinite time horizon

A

longer

37
Q

If discount rate is 0 then:

A

we get the first reward and everything else will fall off to nothing

38
Q

If discount rate is 1 then:

A

we get a maximized reward

39
Q

Utility of Sequences

A

Summation of gamma to the t times reward of the state at time t

Equals R_max divided by (1 - gamma)

40
Q

Value Iteration (Bellman Equation)

A
  1. Start with arbitrary utilities
  2. Update utilities based on neighbors
  3. Repeat until convergence

Guaranteed to converge because with each step we’re adding R(s), which is a true value.

41
Q

Policy Iteration (Bellman Equation)

A
  1. Start with an arbitrary policy pi_0
  2. Evaluate how good that policy is by calculating the utility of using the policy
  3. Improve the policy
42
Q

In MDP, our input is

A

a model consisting of a transition function T and a reward function R, and the intended output is to compute the policy pi (Planning)

43
Q

In Reinforcement Learning, the inputs are

A

transitions (initial state, action, reward, result state, …), and the intended output is to “learn” the policy pi.

44
Q

Policy Search Algorithm (RL)

A

Mapping states to actions.

Problem with this approach is that the data doesn’t reveal which actions need to be taken (Temporal Credit Assignment Problem)

45
Q

Value Function based Algorithms (RL)

A

Mapping states to values.

Use argmax to turn into policy

46
Q

Model-based Algorithm (RL)

A

Mapping (states & actions) to (next state & reward)

Turn this into a utility function using Bellman equations then use argmax to get the policy. This is computationally indirect

47
Q

Q-function

A

Utility of leaving state s via action a, which is the reward of state s plus the discounted expected value of taking action a multiplied by the value of the optimum action in state s_prime

48
Q

Q-Learning

A

Estimating the value of Q(s,a) based on transitions and rewards

Issue is that we don’t have R(s) and T(s, a, s_prime) since we’re not in an MDP setting

49
Q

Q_hat(s,a)

A

Estimate of the Q-function that we will update by a learning rate of alpha_t in the direction of the immediate reward r plus the estimated value of the next state

50
Q

Exploration

A

Getting the data that you need (learning)

51
Q

Exploitation

A

Stop learning and actually use what you’ve already learn

52
Q

Exploration-Exploitation dilemma

A

There is only one agent interacting with the world with conflicting actions

53
Q

Game Theory

A

Mathematics of conflicts of interests when making decisions.

Related to AI when there’s multiple agents acting in the environment

54
Q

In MDP we have the notion of a “policy”, which is mapping states to actions.

In Game Theory, we have the notion of a <>

A

Strategy

55
Q

Strategy (Game Theory)

A

Mapping of “all possible” states to actions

56
Q

Fundamental Result (Game Theory)

A

Everyone is trying to maximize his/her own reward while knowing that everyone is trying to do the same

57
Q

Nash Equilibrium (Game Theory)

A

We’re in a situation where no one has any reason to change his strategy

58
Q

Mixed Strategies (Game Theory)

A

Instead of making the same decisions given the same circumstances (pure strategy), we assign probabilities to our state (distribution over strategy)

59
Q

Pure Strategies (Game Theory)

A

Make same decisions given the same circumstances

60
Q

3 Theorems resulting from Nash Equilibrium

A
  1. In the n-player pure strategy game, if equilibrium of strictly dominated strategies eliminates all but one combination, that combination is the Nash equilibrium
  2. Any Nash equilibrium will survive elimination of strictly dominated strategies
  3. If the number of players is finite and we have finite number of strategies for each player, there exists at least one (possibly mixed) Nash equilibrium