# CS 7641 Final Flashcards Preview

## CS - 7641 - Final > CS 7641 Final > Flashcards

Flashcards in CS 7641 Final Deck (136)
1
Q

Supervised Learning

A

Use labeled training data to generalize labels to new instances
(function approximation)

2
Q

unsupervised learning

A

making sense out of unlabeled data

data description

3
Q

A
• Consider each object a cluster (n objects)
• Define intercluster distance as the distance between the two closest two points in the two clusters
• Merge two closest clusters
• Repeat n-k times to make k clusters
4
Q

Intercluster Distance in Single Linkage Clustering

A

the distance between the 2 closest points in the 2 clusters

5
Q

Inter cluster distance represents

A

domain knowledge

6
Q

T/F Single Link Clustering is Deterministic

A

True. Unless there are ties, it will give us an answer.

7
Q

In SLC, if distance can be represented by edge link of a graph, what algorithm is it the same as?

A

Minimum Spanning Tree

8
Q

What is the running time of Single Link Clustering (SLC) in terms of n points and k clusters

A

O(n^3). O(n^2) to look at all distances to find closest pairs and have to do this n times

9
Q

Issues with SLC

A

Could end up with weird clusters that surround other clusters

10
Q

k-means clustering algorithm

A
• pick k centers (art random)
• each center “claims” its closest points
• recompute the centers by averaging the clustered points
• repeat until convergence
11
Q

When does k-means converage

A

When recomputing the centers does not move

12
Q

T/F the centers have to be a point in the collection of objects

A

False

13
Q

k-means in euclidean space process

A

circle between partition assignment and recalculating the centers

14
Q

Which optimization algorithm is most like k-means? HC, GA, or SA

A

Hill Climbing. You take a step towards a configuration that is better than you have before

15
Q

K-means runs out of configurations and converges in finite time when

A

finite configurations, break ties consistently, and never go into a configuration with a higher error

16
Q

Big O for Each iteration of k-means clustering is

A

polynomial (pretty fast). O(k n): look at distance between each k center and n points and run through all the points to redefine clusters. could also have extra d if d dimensions

17
Q

Big O for Finite iterations of k-means

A

exponential O(k^n). first of n objects can be in any of k clusters, 2nd of n objects can be in k clusters, etc. In practice, it is much faster

18
Q

If ties are broken consistently in k-means then error

A

decreases [with one exception]

19
Q

T/F k-means cannot get stuck

A

False (local optima can still happen)

20
Q

How to avoid k-means getting stuck

A

random restart OR do initial analysis and pick centers in good spots

21
Q

Soft Clustering Overview

A

Points can be probabilistically from one cluster or another

22
Q

Soft Clustering Idea

A

Assume the data was generated by
1. select one of k gaussians uniformly [fixed known variance]
2. sample x from Gaussian
3. repeat n times
Task: find a hypothesis that maximizes the prob of the data (ML)

23
Q

What is the maximum likelihood gaussian (soft clustering)

A

The ML mean of the Gaussian is the mean of the data

24
Q

Expectation Maximum

A

Has two phases expectation which is soft clustering and maximization

25
Q

How is expectation maximization like k-means

A

k-means if cluster assignments use arg max (prob of 0 or 1). It improves in a probabilistic metric like k-means improves in squared error metric

26
Q

Properties of Em

A
• Monotonically non-decreasing likelihood (it’s not getting worse)
• Does not converge b/c infinite configurations b/c infinite number of probabilities (partially does)
• Will not Diverge (numbers don’t blow up and become infinitely large)
• Can get stuck (local optima)
• Works with any distribution (If EM solvable)
27
Q

Clustering Properties

A
• Richness
• Scale Invariance
• Consistency
28
Q

Richness

A

For any assignment of objects to clusters, there is some distance matrix D such that your clustering algorithm returns that clustering.

29
Q

Scale Invariance

A

scaling distances by a positive value does not change the clustering

30
Q

Consistency

A

Shrinking intra cluster distances and expanding intercluster distances does not change the clustering

31
Q

Impossibility Theorem

A

No clustering can achieve all three of richness, scale invariance, consistency

32
Q

Curse of dimensionality

A

Data needed grows exponentially with features (2^n)

33
Q

Big O of feature selection

A

np-hard, 2^n, exponential

34
Q

Two approaches to feature selection

A

filtering and wrapping

35
Q

filtering overview

A

set of features as input, passes to same search algorithm that outputs fewer features

36
Q

wrapping overview

A

set of features, search over some subset of features, hands to leaning algorithm which reports how well those features performed

37
Q

Pros of filtering

A

speed: faster than wrapping because you don’t have to worry about what the learner does.
Ignores the learning problem

38
Q

cons of filtering

A

look at features in isolation.

39
Q

Pros of wrapping

A

takes into account model bias and learning itself

40
Q

con of wrapping

A

extremely slow compared to filtering

41
Q

ways to do feature filtering

A
• information gain
• variance, entropy
• “useful features”
• independent/non redundant
• ex: DT, NN (trim features with low weights)
42
Q

ways to do feature wrapping

A
• hill climbing
• randomized Optimization
• forward search
• backward search
43
Q

forward search

A

Look at all features in isolation, keep whatever feature is best. Then look at adding any of the remaining features and choose whichever is best in combination with first feature, etc. This is similar to hill climbing

44
Q

backward search

A

start with all the features and figure out which one can you eliminate, one at a time until you get to a subset that performs well

45
Q

Relevance: strongly relevant

A

xi is strongly relevant if removing is degrades the Bayes optimal classifier

46
Q

Relevance: weakly relevant

A

Xi is weakly relevant if

• not strongly relevant
• and there exists some subset of features such that adding xi to the subset improves the bayes optimal features
47
Q

Relevance measures effect on

A

Bayes optimal classifier. INFORMATION

48
Q

Usefulness measures effect on

A

a particular predictor. ERROR given model/learner (c is useful in neural nets)

49
Q

feature transformation

A

the problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible

50
Q

T/F Feature selection is a subset of feature transformaiton

A

True

51
Q

feature transformation vs feature selection

A

transformation: linear combination of original features like 2x1 + x2 as 1 feature. selection: have x1, x2, x3, x4 and only take x1 and x2

52
Q

A

a whole bunch of databases and you want to retrieve the subset of documents that are relevant to the query (features)
ad hoc - you don’t know what they queries are

53
Q

polysemy

A

words can have multiple meanings
ex: car- Vehicle and also stupid thing in LISP
results in false positives

54
Q

synonomy

A

many words mean the same thing. ex: car and automobile are same thing
results in false negative

55
Q

principal components analysis overview

A

finds directions that maximizes variance (principal component)
and finds directions that are mutually orthogonal

56
Q

principal component analysis properties

A
1. maximize variance (principal component)
2. mutually orthogonal - global algorithm
3. you can prove best reconstruction (minimize l2 error by projecting onto a new line)
4. you can throw away dimensions with the smallest eigenvalues
57
Q

best reconstruction in principal component analysis (PCA)

A

it is simply a re-labeling of the dimensions. The directions just tell you how far along each axis it is. It is just a linear rotation of the original dimensions. the new dimensions that PCA returns does not lose any information. It’s just a rotation of the data. You can reconstruct your original data with the information

58
Q

mutually orthogonal in PCA

A

Which means it’s a global algorithm. All of the directions and the new features that they find have a big global constraint → have to be mutually orthogonal.

59
Q

A

Is there another basis, which is a

linear combination of the original basis, that best expresses our data set?

60
Q

eigenproblem

A

a computational problem that can be solved by finding the eigenvalues and/or eigenvectors of a matrix. In PCA, we are analyzing the covariance matrix (see the paper for details)

61
Q

PCA vs ICA overview

A

PCA: correlation, maximizes variance => reconstruction
ICA: independence, linear transformation into new feature space such that each new feature is independent of each other and mutual information as high as possible between original and new features

62
Q

Blind Source Separation Problem (cocktail)

A

It is difficult to listen to one conversation at once when there are a lot of conversations going on at the same time. People are hidden variables in here and microphones are the observables

63
Q

Independent Components Analysis (ICA) goal

A

figure out the hidden variables from observables

64
Q

PCA vs ICA

• mutually orthogonal
• mutually independent
• maximal variance
• maximal mutual information
• ordered features
• bag of features
A
• mutually orthogonal: PCA
• mutually independent: ICA
• maximal variance: PCA
• maximal mutual information: ICA
• ordered features: PCA
• bag of features: ICA and ish PCA
65
Q

Which is global and which is local: PCA vs ICA

A

PCA is global (in face finds brightness then average face)
ICA finds local/structure (in faces it finds noses then eyes, in the natural world, it finds edges. In documents it’s topics)

66
Q

RCA: Random Components Analysis

A

Generates random directions
projects data into those directions
works very well if the next thing you do is classification

67
Q

What is the big advantage of RCA

A

Fast (much faster than PCA or ICA)

68
Q

LDA: Linear Discriminant Analysis

A

Find a projection that discriminates based on the label. Pays attention to how the resulting components are used
“In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes.”

69
Q

ICA is important for finding

A

structure and edges

70
Q

PCA vs ICA: which uses primarily Probability vs primarily Linear algebra

A

ICA: probability
PCA: linear algebra

71
Q

Markov Decision Process: Model

A

Transition State. T(s,a,s’) is transition probability of being in state s, take action a, and end up in state s’. This function produces the P(s’|s,a): The probability of a new state given your current state and given the action you take

72
Q

Markov Decision Process: Actions (A)

A

The things that you can do in a state (up, down, left, right)

73
Q

Markov Decision Process: Reward

A

Scalar Reward for being in a state, taking action, etc. (Red and green states in board example). Represents our Domain Knowledge

74
Q

Markov Decision Process: Policy

A

A policy is a solution to Markov decision process. Tells you what action to take from a particular state

75
Q

Markov: Infinite Horizons

A

We assumed there is not a time that is clicking. If there is then policy would depend on state and time

76
Q

Markov: Utility of Sequences

A

if I prefer one sequence of events today, then I prefer the same sequence tomorrow

77
Q

Markov: discounted

A

discounted rewards allows us to go infinite distance in finite time because it is geometric. Rmax/1-gamma

78
Q

Markov: Utility of a policy at a state

A

the expected outcome from that point if we follow the utility function f

79
Q

Markov: utility vs reward

A

Utility: reward for that state and all the reward from that point on (Long term)
Reward: Immediate: reward (Short term)

80
Q

Bellman Equation

A

Key equation for RL. A recursive equation that defines the true value of being in some particular state including policy, rewards, transition matrix, gammas, etc. It is the utility of the state.

81
Q

Markov Decision: Value Iteration

A

start w/ arbitrary utilities, update utilities based on neighbors, repeat until convergence

82
Q

Markov Decision: Policy Iteration

A

start with some policy (guess), evaluate its utility, improve policy by updating it to the policy that takes the action that maximizes the expected utility based on what we just calculated. Repeat

83
Q

What is Q-Learning

A

Evaluating the Bellman equations from data

84
Q

Markov Property (2 things)

A
• only the present matters

- things are stationary (the rules don’t change)

85
Q

Advantage of policies in Markov decisions vs “plans”

A

They are robust to the stochasticity of the world

86
Q

Markov: Why use gamma^t in utility of sequences

A

Rewards are substantial at first and quickly trail off

87
Q

RL: What is Q(s,a) function in plain english

A

the value for arriving in s, leaving via a, and proceeding optimally thereafter

88
Q

RL: What happens if you set alpha (learning rate) of 1

A

Full learning, forget everything you learn and just jump to new v

89
Q

RL: What happens if you set alpha (learning rate) of 0

A

Won’t learn at all. Will just assign v to v

90
Q

Q-learning family of algorithms varies on what 3 things (generally)

A
• how initialize Q hat
• how to decay alpha sub t
• how to choose actions
91
Q

How to choose actions in Q-learning

A

Can choose Q hat which is greedy (always choose best Q-hat), but can hit local mins. Solve this problem with simulated annealing like approach (take random actions sometimes)
Could choose Q hat randomly, but then you are not using what you learned about Q hat so far. You learn optimal policy, but don’t follow it.

92
Q

Problem with random restarts in Q-learning

A

Very slow! takes a long time to get to infinity without restarts.. even longer with

93
Q

Fundamental Resul Theorem

A

in a 2 player, zero sum deterministic game of perfect information, minimiax = maximin and there always exists an optimal pure strategy for each player (with rational agent)

94
Q

nash equilibrium

A

if and only if each player won’t change their strategy given all the other player’s strategy

95
Q

T/F Any N.E. will survive elimination of strictly dominated strategies

A

True

96
Q

T/F There is always a N.E. (maybe mixed) for finite strategies and finite player games

A

True

97
Q

T/F In the n player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination it he unique N.E

A

True

98
Q

Tit-for-tat

A

First round = corporate. All future rounds = copy the opponent’s previous move. A strategy for infinite games.

99
Q

Folk Theorem in Math

A

General idea: in repeated games, the possibility of retaliation opens the door for cooperation
Folk Theorem: In math: results known, at least to experts in the field, and considered to have established status, but not published in complete form

100
Q

game theory: feasible region

A

average payoff of some point strategy

101
Q

Game Theory: Folk Theorem

A

Any feasible payoff profile that strictly dominates the minimax/security profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor. Proof: if it strictly dominates the minimax profile, can use it as a threat. Better off doing what you are told

102
Q

Grim Trigger

A

As long as you cooperate, you get mutual benefit. If you ever defect “deal out vengeance” forever

103
Q

Implausible Threat

A

The vengeance could cost more than trying to cooperate so it’s not realistic to always punish opponent

104
Q

Game Theory: Subgame Perfect

A

Each player is always taking a best response independent of history

105
Q

Game Theory: Pavlov

A

cooperate if agree, defect is disagree

106
Q

T/F is Pavlov sub game perfect?

A

Yes. No matter what state they are in, the average reward is mutual cooperation

107
Q

Computational Folk Theorem

A

Can build pavlov like machine for any game. construct subgame perfect nash equilibrium for any game in polynomial time. This is because if it’s pavlov, we can get to pavlov quickly, and if not it is zero-sum like (solve an LP), or at most one player improves

108
Q

bimatrix game

A

2 players and each player has its own reward matrix and is average reward repeated game

109
Q

what makes stochastic games more interesting

A

actions that the players take impact not just the rewards but also future states

110
Q

How can you use bellman equation in zero sum stochastic games

A

use minimax on q value instead of max

111
Q

Properties of Q learning in zero sum stochastic games

A
```value iteration works
minimax-Q converges under same conditions that Q does
unique solution to Q*
policies can be computed independently
update efficient
Q functions sufficient to specify policy```
112
Q

General Sum Stochastic Games

A

Can’t do minimax anymore. Instead use Nash of Q.

113
Q

What properties change for general sum stochastic games

A

Value iteration doesn’t work
Nash doesn’t converge
No unique solutions to Q*
Policies can not be computed independently
The update is not efficient unless P=PPAD
Q functions are not sufficient to specify policy

114
Q

Expectation Maximization - soft clustering / expectation step

A

We know the means (assume the means) and then calculate how likely it is that the data came from the means. Knowing means and variances allows you to calculate which point came from which cluster. For each point, which cluster is it from?

115
Q

Expectation Maximization - Maximization steps

A

We know the clusterings, so calculate the means and variances for these clusters.

116
Q

monotonically non decreasing

A

finding higher and higher likelihoods

117
Q

A
1. knowledge discovery - it is useful to keep understand what features are meaningful (interpretability and insight)
2. curse of dimensionality
118
Q

What algorithm looks at feature selection

A

Decision Trees (a type of filtering) and in particular, information gain

119
Q

What algorithm did we learn in the first half of the course that does feature transformation

A

neural nets

120
Q

principal component analysis. Why can throw away eigenvalues

A

You project onto n dimensions with n features. As you move from 1st principal component to n principal components, the eigenvalues monotonically non-increasing. You can throw away the ones with the least eigenvalue = throw away the ones with the least variance

121
Q

Con of PCA

A

If one of your original dimensions is directly related to the data but there is gaussian noise, PCA will end up throwing away that original dimension

122
Q

PCA is like filtering or wrapping

A

filtering. It transforms into a new space where features can be filtered

123
Q

T/F PCA goal is to find independent projects

A

False. It can happen to find independent projects. It’s finding uncorrelated dimensions

124
Q

How can ICA want to maximize mutual information and also mutually independent

A

Max mutual info between new features and old features

Mutually independent between new features

125
Q

What feature transformation problem does much better in blind source separation problem and is directional

A

ICA. Directional meaning it gives very different answers if you rotate the matrix that it is given.

126
Q

RL: 3 different types of RL

A

Policy search: map states to actions through policies
Value function based: map state to value through utilities
Model-Based: map states and actions to new states and rewards through transitions and rewards

127
Q

RL: Q-Learning what is U(s) and pi(s)

A
```U(s) = max over all actions of Q(s,a) in that state
pi(s) = argmax over all action of Q(s,a)```
128
Q

RL: What is Q-learning

A

evaluating the Bellman equation from data. We know the transitions, not the model. We know

129
Q

Q-learning converges if

A
• we visit s,a infinitely often so it needs to run a very long time
• s’ needs to be drawn from the actual transition probabilities T(s,a,s’)
• rewards need to be drawn from reward function R(s)
130
Q

minimax

A

Players consider worst case counter. A is trying to max and B is trying to min. So A finds the maximum minimum and B is trying to find the minimum maximum. There always exists an optimal pure strategy for each player

131
Q

Relaxing Which constraint makes minimax fail?

A

change from perfect information to hidden information. Minimax works in 2-sum zero-sum deterministic OR nondeterministic games of perfect information with pure strategies

132
Q

mixed strategy vs pure strategy

A

mixed strategy, you choose with probability

133
Q

Game Theory: finite state strategy. How to solve?

A

You can use MDPs! Only the last state matters, etc.

134
Q

minmax profile

A

pair of payoffs, one for each player, that represents the payoff achieved a player defending itself against a malicious adversary

135
Q

security level

A

when you can have mixed strategy with minmax profile

136
Q

q-learning does not work with general stochastic games, what could work?

A

repeated stochastic games (folk theorem)
cheap talk -> correlated equilibrium. players can talk a little bit
cognitive hierarchy -> best response. Rather than solve for equilibria, best response to what you believe the other player will do
side payments - (cocoa values) players can pay the opponent to help them