Final Prep Flashcards

Question

What does feature relevance do? (14.10)

Answer 1

It measures the effect on the Bayes Optimal Classifier (increase or decrease). Relevance is about how much information a feature provides. Example: If a binary feature C is 1 for all samples, it has 0 entropy and is irrelevant.

Answer 2

Usefulness represents the effect on the error of a model or learning algorithm. This is what we ultimately care about.

Answer 3

Make sure to understand what's going on here.

Answer 4

Look this up. Computes the best label given all probabilities you could computer over all the hypotheses. Any other algorithm has an inductive bias. The gold standard - the best you could do if you had everything and all the time in the world. Relevance is usefulness to the

Answer 5

I think this is teased out in this video, but needs to be summarized better. Computes the best label given all probabilities you could computer over all the hypotheses. Any other algorithm has an inductive bias.

Answer 6

Pre-processing a set of features to create a new, smaller (usually) set while retaining as much information as possible (useful, relevant information). Example, starting with features x1, x2, x3, x4: - Feature Selection => x2, x3 - Feature Transformation => 2*x1 + x3

Answer 7

Finds the direction of maximal variance in the data and finds directions that are mutually orthogonal. So it first finds the direction of max variance (1st or principal component), then finds the line orthogonal (2nd or 2nd principal component). It's like doing a transformation into a new space where feature selection can work (15.7). An eigenvalue of 0 for a particular dimension would mean there is no variance and thus gives you no information (irrelevant, but doesn't mean its useless). About finding correlation by maximizing variance and ability to do reconstruction (15.8)

Answer 8

* It gives you the principal direction of maximum variance * It gives you orthogonal directions beyond that * It's a global algorithm - You're working across the entire, global space of the problem domain * Gives you the best reconstruction * The eigenvalues of successive principal are monotonically non-increasing, which means they tend to decrease. So each successive principal essentially has less variance, and you can throw away the ones with the last variance. * Can look at eigenvalue to see which dimensions are important

Answer 9

If you were to perform PCA and get the principal component and project all the samples only onto that one dimension and then try to re-project onto the original dimension space, it would give you the smallest L2 (squared error) error of any other projection. By maximizing variance, you're maintaining distances as best you can for any given dimension.

Answer 10

The goal of ICA is create a new set of features from the originals where each new feature is statistically independent from every other new feature, that is mutual information between all new feature pairs is 0 (knowing the value of one tells you nothing about the value of any other), and that it maximizes the mutual information between the new features and the original features.

Answer 11

The hidden variables are the sources - they are what you are trying to recover. Thy represent the new, statistically independent features. The observables are the original features. They represent some linear combination of the hidden variables.

Answer 12

There are N people in a room and N microphones. Each person represents a hidden variable while each microphone is a linear combination of each of the people talking. So the people are the hidden variables and the microphones are the features. The people are statistically ind ependent because they are producing their own sound waves which are independent of all other people's sound waves.

Answer 13

Feature selection is a subset of feature transformation. Feature transformation has a goal of trying to reduce the number of features by creating new features (x1, x2, x3 => x1 + 3*x2) while feature selection is reducing the number of features by choosing amongst those which have the most influence on the learning output.

Answer 14

• Big One: ICA is local and does selects small features (nose, eyes, mouth, edge detection, etc) while PCA is global and would see things like 'brightness' of an image, or the 'average' of an image. • ICA is highly directional while PCA is not (you could feed PCA the input matrix in any orientation and because it's already providing something directional, it would just be a different transformation and you end up with the same results). • ICA can help you find the underlying structure of the data * ICA is good at the blind source separation problem while PCA is terrible at it. PCA doesn't perform • PCA has ordered features while ICA is more like a bag of features • PCA's features are mutually orthogonal while ICA's are not • ICA's features are mutually independent while PCA's are not • PCA maximizes variances while ICA maximizes mutual information (b/t original and new features) • ICA is focused on probability while PCA is usually solved by linear algebra

Answer 15

* RCA - random component analysis. Generates random direction. It is fast and tends to perform pretty well. The number of final dimensions is usually higher than that of PCA and ICA. Tends to produce correlations. * LDA - linear discriminant analysis. Finds a projection that discriminates based on the label, so it at least uses the values of the labels as information.

Answer 16

- The measure of distance between clusters is the distance between the farthest two points in each cluster. - You will still merge clusters based on the shortest distance, but you're using a max to define that distance. - Tends to produce spherical clusters with consistent diameter https://www.youtube.com/watch?v=VMyXc3SiEqs

Answer 17

The sum of the two players rewards is constant. That constant may or may not be zero constant is 0: player 1 = + 5 player 2 = - 5 constant is 3: player 1 = +3 player 2 = 0 player 1 = +1 player 2 = +2 etc...

Answer 18

Perfect information means you know what state the game is in (and thus know the state of everything in the game)

Answer 19

One player is trying find the maximim, minimum (What's best for itself) and the other player is trying to find the minimum, maximum (what is also best for itself, in a zero sum game)

Answer 20

Minimax == Maximin and there always exists an optimal, pure strategy for each player. This assumes each player is going to play the same way, eg, maximize their own rewards, or they are being 'rational'. Can be deterministic or non deterministic.

Answer 21

Mixed strategies have some distribution over strategies, as opposed to choosing a fixed strategy. Ex: 25% holder, 75% resigner

Answer 22

You can plot the lines and then take the maximum of the minimum. This will either be located at the beginning, end, or where they cross.

Answer 23

A Nash Equilibrium exists if _neither_ player, if chosen at random, would switch their strategy, despite the other play keeping their strategy. In other words, if no one would want to change their strategy given the choice to do so, you're in a NE. This works for both mixed and pure strategies.

Answer 24

If you would always choose one strategy over another, no matter what, that chosen strategy strictly dominates the others.

Answer 25

* For N player pure strategy game, if eliminating all strictly dominated strategies leaves you with one, then it is the NE * A NE will survive elimination of strictly dominated strategies * if there is a finite number of players and a finite number of strategies then there exists at least one NE that may involve mixed strategies.

Answer 26

No ... If you work it backwards, the last game is determined by the matrix, and thus each previous game is also just determined by the matrix. So it doesn't matter how many repeated games you play, the last one is the only one that matters and it's defined by the same matrix.

Answer 27

No - you come to the same solution regardless of who starts.

Answer 28

You cooperate in the first round, then copy your opponents _previous_ action every round thereafter.

Answer 29

For a low gamma - you should always defect. Expected value of game = 0 + -6 * gamma / (1 - gamma) For a high gamma, you should always cooperate. Expected value of game = -1 / (1 - gamma) The gamma where the strategies are equal is 1/6. You may need to recalculate these with different rewards, but the same concept applies.

Answer 30

"Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a set of Nash equilibrium payoff profile, with sufficiently large discount factor"

Answer 31

The best score you can guarantee yourself against a malicious adversary (eg, defect, defect).

Answer 32

The zone where a better score can be obtained for both players than either could guarantee themselves if they were playing against a malicious adversary.

Answer 33

Start in mutual cooperation, but if you ever 'cross the line' (defect), 'I will deal out vengeance forever' (always defect regardless of what you do).

Answer 34

Both plays always play their best response independent of history. They are not subgame perfect (eg, you're making an implausible threat) if when the time comes for you to follow through with what your strategy says to do, it would actually be worse for you than doing something different. You take over the brain of players, set the moves, then release them. Given the choice, would they still follow their strategy or would it be better to do something different? If better to do something different - not subgame perfect.

Answer 35

Cooperate if we agree, defect if we disagree. This is a NE for prisoners dilemma and it's subgame perfect. No matter what states two pavlov machines playing against each other are in, they will always end up in mutual cooperation. It makes plausible threats.

Answer 36

For a bi matrix, repeated game looking at the long term average reward (assuming a high discount), if it’s possible to have a mutually beneficial relationship you can build a pavlov-like machine for any game and use it to construct subgame perfect NE in polynomial time. - if it’s not possible to have mutual cooperation then it's a zero-sumlike game - then we would solve using LP which either produces a NE or it doesn't and at most one player improves.

Answer 37

- you use a minimax(Q*(s', (a', b')) in the bellman equation instead of a max - value iteration works - minimax-Q converges - unique solution to Q* - policies can be computed independently - Q Update can be computed efficiently (LP) - Q functions sufficient to specify policy - not known whether it can be solved in polynomial time like a standard MDP is known to be able to

Answer 38

- instead of computing minimax, you would compute the NE in the combined bellman equation (eg, Nash-Q) - value iteration doesn't work - Nash-Q doesn't converge - No unique solution to Q* - policies cannot be computed independently b/c it's defined as a joint behavior - update is not efficient - Q functions not sufficient to specify a policy :(

Answer 39

You're scoring different configurations. You're defining a configuration consisting of a partition and it's center, then trying to minimize the error, which can be seen as: Sum_x [ ||center_p(x) - x||^2], or the sum of the squared distance from each point to the center. It is most like hill climbing

Answer 40

Markov Property: Only the present matters; you don't care about history The environment must be stationary: The rules of the environment can't change

Final Prep Flashcards

Clustering Onwards (64 cards)