Final Exam Flashcards

Question

What is the Behaviour of the K-Means Error Curve as the Algorithm Progresses?

Answer 1

It is monontically non-increasing. Every time cluster centroids are shifted the error term (sum of intra cluster squared distances) will never decrease.

Answer 2

Either conduct random restarts or intelligently spread the centroid seeds across the dataset.

Answer 3

A way of using probability to assign points to clusters.

Answer 4

1. Expectation: Derive the expected value of z from the centroid means by calculating the normalized probability for every point that it was drawn from that centroid defined as a probability distribution. (In this case a normal distribution.) Across i points and j centroids: E[z_ij] = (P(X=x_i|mu=mu_j)) / (sum_(i=1 to k) P(X=x_i|mu=mu_j)) P(X=x_i|mu=mu_j) = e^{-1/2 (x_i - mu_j)^2 sigma^2} 2. Maximization: Derive new centroid means by calculating the average weighted by the probabilities created in the expectation sep. mu_j = sum_i(E[Z_ij] * x_i) / sum_i(E[Z_ij]) The runtime is similar to K-Means, but it's slower.

Answer 5

K-Means *is* EM with some assumptions made.

Answer 6

1. Monotonically non-decreasing likelihood. 2. Does not converge (define a threshold for convergence). 3. Will not diverge. 4. Can get stuck in local optima. 5. Can work with any distribution if E,M are solvable.

Answer 7

Richness, scale invariance, and consistency. You can always have two of them, but not three. With a little hand waving you can get pretty close.

Answer 8

Discover knowledge and add interpretability. Reduce or remove the curse of dimensionality; the amount of data necessary is exponential in the number of features (2^n)

Answer 9

2^n | This is NP hard.

Answer 10

Filtering is where the feature selection algorithm is run, once, up front and the filtered features are passed on to the learner.

Answer 11

Wrapping is where the feature selection algorithm and the learner work cyclically, informing the other process.

Answer 12

Speed: Filtering is fast; you don't care what the learner wants and this process ignores the learning problem. This is recommended for a well known problem. Wrapping is *slow*

Answer 13

Decision trees and boosting implicitly filter.

Answer 14

Define a metric / criterion such as information gain to determine the usefulness of a set of features. You could use variance or entropy. Run a neural net and prune the features with low weights. This leaves you with 'useful' features.

Answer 15

Similarly to filtering but it is an iterative process that *improves* on some metric. Run the learning algorithm with the down selected features and use the SSE as a point in an optimization surface. You could use many different optimization routines like RHC.

Answer 16

Forward search: Start with no features. Add the single feature which most improves the metric of concern (i.e. SSE from your classifier / regressor). Add another single feature in the same manner. Repeat. This is a form of hill climbing. Backward search: Start with all features. Remove the single feature which, when removed, most improves the metric of concern. Combination: Both.

Answer 17

When removing it degrades the bayes optimal classifier.

Answer 18

When it is not strongly relevant and there exists some subset of features where adding this feature back in improves the bayes optimal classifier.

Answer 19

When it is not strongly or weakly relevant.

Answer 20

Usefulness measures the effect of the feature on a particular predictor. It's about minimizing error given an algorithm; it could be irrelevant but useful!

Answer 21

Preprocessing a set of features to create a new (hopefully smaller and or more compact) set of features while retaining as much (relevant / useful) information as possible.

Answer 22

Although the goals are the same (new, relevant, and smaller set) feature selection is a subset of transformation. Transformation can be an arbitrary process but transformation is limited to linear transforms (normally).

Answer 23

x -> F^n ~ F^m This takes instances down from some n dimensional to m dimensional space. (m is normally less than n) P^{T}x The transformation operator is normally a linear operator.

Answer 24

No, you can project into higher dimensional space.

Answer 25

You don't need all the original features; you need a smaller set that might require bringing in information across the features.

Answer 26

You can minimize polysemy and synonmy.

Answer 27

An effect that leads to false positives. (A word can have many meanings)

Answer 28

An effect that leads to false negatives. (Many words can have one meaning)

Answer 29

To find the axes of maximum variance. What does this mean? It means it minimizes the L2 norm (SSE) from the axis to the points.

Answer 30

It must be orthogonal to the first.

Answer 31

Singular value decomposition.

Answer 32

1. They are guaranteed to be non-negative | 2. They monotonically decrease.

Answer 33

You can remove noise and improve the accuracy of a classifier.

Answer 34

Independent Components Analysis. It attempts to maximize indepence among the new dimensions by finding a linear transform of the old space where the new features are statistically independent by maximizing kurtoses.

Answer 35

A measure of how much one variable tells you about another.

Answer 36

PCA guarantees your new axes / features will be mutually orthogonal. ICA guarantees your new features will be mutually independent. PCA maximizes variance. ICA maximizes mutual information. PCA returns an ordered set of features. ICA returns an unordered set of features.

Answer 37

No, it's terrible at it. ICA can do it quite well.

Answer 38

RCA (Random Component Analysis) or Random Projection. It projects onto random axes, and turns out to work well. LDA (Linear Discriminant Analysis). This finds projections which discriminate based on the label. Come up with a projection that puts things into separate clusters. LDA cares about the label.

Answer 39

Take the highest probability letter and assign it the bit 0 (or 1). The other goes down the right hand side of the tree and the *next most common* letter get 0 (or 1) for its *second* bit. Repeat ad nauseum. Now you have something like: * A = 0 * B = 111 * C = 110 * D = 10 The expected message size is $\sum P_i Length_i$ or $0.5 * 1 + 0.25 * 2 + 0.125 * 1 = 1.125$ This is called variable length encoding.

Answer 40

$H(s) = -\sum_s P(s) log(P(s)) = -(0.5log(0.5) + 2 * 0.25log(0.25) + 0.125log(0.125) = -1.875$ Note these are log base 2.

Answer 41

$H(X|Y) = -\sum P(X|Y) log(P(X|Y))$ Note these are log base 2.

Answer 42

$H(X,Y) = H(X) + H(Y)$

Answer 43

$I(X,Y) = H(Y)-H(Y|X)$

Answer 44

High values of I indicate that Y depends heavily on X and there is a lot of mutual information. Low values indicate that Y does not depend on X (it is an independent variable) and there is no mutual information.

Answer 45

Kullback-Leibler Divergence is a distance measure used between two distributions. $KL = - \integral P(X) log(P(X)/Q(X))$ It's not a *complete* distance measure because it doesn't follow the triangle law (I still need to look this up). In ML we use a well known distribution for P and sample from Q to determine difference.

Answer 46

MDPs capture world information by dividing the world into states (s) which are a combination of the variables that describe your agents behavior at a given point in time. Every state has a set of actions (a_s) that are available for it to use and a set of following states (s') that it could transition into with some probability T(s,a,s'). Your probabilities across s' from s must sum to 1. $\sum_si T(s,a,s') = 1$ ``` States: S Model: T(s, a, s') ~ Pr(s'|s,a) Actions: A(s), A Reward: R(s), R(s,a), R(s,a,s') Policy: $\Pi(s)$ -> a $\Pi^* $ ```

Answer 47

Isn't that from the avengers movies? Just kidding... You don't have to condition on anything except the present! The past doesn't matter.

Answer 48

Almost anything, but given too much information it can get unwieldy.

Answer 49

Inside the reward function..

Answer 50

It maximizes total reward over a 'lifetime'.

Answer 51

Associating an action with 'final rewards' given that in this state you took this ation. This gives a sequence, a time, and a final reward.

Answer 52

Use a discounting factor (commonly called $\gamma$.)

Answer 53

$U(s) = V(s) = R(s) + \gamma max_a \sum_s T(s,a,s') U(s')$

Answer 54

Using the Bellman equation... $\hat{U}_{t}(s) = R(s) + \gamma max_a \sum_s' T(s,a,s') \hat{U}_{t}(s')$ Start the process initializing values / utility for every state to some value (0 is common.) Then, essentially, for every state take the reward for that state and add the *discounted* sum of utilities for every state that it could transition into multiplied by the probability of getting to that state. This has the effect of penalizing bad routes over time.

Answer 55

Using the Bellman equation... $\hat{U}_{t}(s) = R(s) + \gamma max_a \sum_s' T(s,\pi_t(s),s') \hat{U}_{t}(s')$ $\pi_{t+1} = argmax_a \sum T(s,a,s') U_t(s')$ Notice that in this case the action is a function of the policy that you have *chosen*. Start with a policy (randomly selected works fine.) Evaluate the Bellman equations given that selected policy. For every state update the policy to the *most optimal* action that could be taken in that policy given the utilities calculated. Reevaluate and reupdate, repeat ad nauseum until convergence.

Answer 56

It reduces the set of equations from n equations with n unknowns to n equations with n unknowns that is *linear*. Now you can use matrix operations to solve for the policy in each iteration. Policy iteration converges in less iterations than value iteration normally.

Answer 57

A modeler (take the transitions and build a model) and a simulator (run the model to build transitions).

Answer 58

1. Use states into $\Pi$ to get actions (Policy) 2. Use states into $U$ to get values (Value) 3. Use states and actions into T,R to get the next state and reward (Transition / Reward) Reinforcement algorithms that are searching in policy space are very indirect and are a temporal credit assignment problem. Those that target values learn to map states to values. Using transitions and rewards are model based learners and can be couched as a supervised learning problem.

Answer 59

$Q(s,a) = R(s) + \gamma \sum_s' T(s,a,s') [max_a' Q(s',a')]$ This is the value for arriving in state s, leaving via a, and proceeding optimally thereafter.

Answer 60

Moving from state s to s' given some action and yielding some reward.

Answer 61

$\hat{Q}(s,a)

Answer 62

The average value you will receive for following the optimal discounted policy after you take this action in this state.

Answer 63

Yes, IF every state / pair tuple is visited infinitely often.

Answer 64

The exploitation / exploration tradeoff. Exploitation, or using what you know, tends to lean towards the states you've seen with high rewards. Exploration has the possibility fo finding *new* state / action pairs with higher reward! One agent has conflicting objectives.

Final Exam Flashcards

Study for Final (92 cards)