Final Prep Flashcards

2nd Half Semester Information

1
Q

<p>Unsupervised Learning</p>

A

<p>Make Sense of Unlabeled Data
Data Description - find a more compact way to describe it

Versus Function Approximation of Supervised Learning with Labeled Data</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

<p>Basic Clustering Problem</p>

A

<p>Given:
Set of objects X
Inter object Distances D(x,y)
which is equal D(y,x)

~~~
Output:
Partition Pd(x) = Pd(y)
this is the "partition function"
which outputs a compact descriptor of all equivalent data
</p>

~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

<p>Single Linkage Clustering (SLC)</p>

A

<p>A space of n objects all distances from each other

The two closest points are connected
To be SLC it must be closest
And then the closes clusters are merged

Until only k clusters are achieved

Distances only matter between un-linked items</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

<p>Hierarchical Agglomerative Cluster Structure</p>

A

<p>Used to represent linkage cluster derived from SLC</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

<p>Running Time of SLC</p>

A
<p>for n points
with k Clusters
O(n^3)
n^2 from the distances and n/2 to link the closest one
find the 
Similar to spanning tree problem
</p>
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

<p>k-means clustering algorithm</p>

A

<p>-pick k "centers" (at random)

- each center claims center
- recompute the centers by averaging clustered
- repeat until converged
- this center could be a point in the collection or a point in the space. This is almost like creating kNN</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

<p>Euclidean space</p>

A

<p>In geometry, a two- or three-dimensional space in which the axioms and postulates of Euclidean geometry apply; also, a space in any finite number of dimensions, in which points are designated by coordinates (one for each dimension) and the distance between two points is given by a distance formula.</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

<p>k-means Proof</p>

A

<p>P(x) : Partition/Cluster of object x
Ci: set of all points in a cluster of i
center = sum(y)/(Ci)

P(x) -> center -> P(x)
</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

<p>K-Means as Optimization</p>

A

<p>Configurations(inputs) - center, P

Scores - the further from the center increases the error of that object E(P, center) = Sum||center - x||^2

Neighborhood-where the P are set and the center changes or the center is set and the P changes
</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

<p>Randomized Optimization most like K-Means</p>

A

<p>Hill Climbing
You are taking steps towards a configuration better than the configuration before. We are finding Neighbors slightly better than before</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

<p>Properties of K-means clustering</p>

A

<p>-Each Iteration is polynomial O(kn)
-finite (exponential) iterations O(k^n)
different ways of assigning partitions.
in practice this is short as there are limited ways
- Error Decreases (if ties broken consistently)
- can get stuck
</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

<p>Stop K-means stuck mitigation</p>

A

<p>-Random Restarts

| - Initial review of data to find centers far apart</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

<p>Equal Distance Points between Clusters</p>

A

<p>for points equal distance between two clusters. It will sometimes be assigned to either cluster
</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

<p>Soft Clustering</p>

A

<p>Assume the data generated by

1) Select one of k Gaussians with known variances
2) Same xi from that Guassian
3) Repeat n times

Task- Find a hypothesis h = </p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

<p>Maximum Likelihood Gaussian</p>

A

<p>- The ML mean of the gaussian is the mean of the data

- does not apply for k different means, just for one
- the k different means is based on assigning each point hidden variables to determine what gaussian it is a part of</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

<p>Expectation Maximazation</p>

A

<p>Tick tok between expectation (definze z from mu) to maximazation where you define ,u from z</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

<p>Properties of EM</p>

A

<p>-monotonically non- decreasing likelihood
- It always moves to something better
- Has a change to not converge (practically does)
- Will not diverge
in k means there is change to diverge
- can get stuck
can make a local optima where it finds a good assignment but we need ot fix with random restart
- works with any distribution
domain knowledge for E and M stop based on data</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

<p>Clustering Properties</p>

A

<p>- Richness
all inputs and any way of clustering is possible
- Scale In-variance
doubling or halving distances does not changing the way the clusters should be determined
- Consistency
shrinking the point to point and expanding the cluster to cluster distance will not change the clustering. This is an application of domain knowledge.</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

<p>Impossibility Theorem</p>

A

<p>Richness/Scale In-variance/ and Consistency cannot all be true at the same time</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

<p>Feature Selection</p>

A

<p>Two Reasons:
1) Knowledge Discovery
Make it able to interpret-ability and Insight
2) Cure of Dimensionality
As you add more data you need exponential to the number of features you have</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

<p>How hard is it to reduce a problem from N features to m features</p>

A

<p>This is an exponential hard problem. 2^N or (n m) which is n choose m. This is NP-hard. This can match the 3 set np-hard problem. </p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

<p>Filtering- Features</p>

A

<p>Features with some type of search algorithm then to a learning algorithm. This is flow forward, but with no feedback. It can look at the labels, so you could use some type of thing like information gain. You could even use DT as the filterer.
Pro
Speed
Cons
Look at features in isolation, but maybe it is important when looked at with something else
Ignores the learner.</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

<p>Wrapping - Features</p>

A

<p>Take in the features, searches over a subset then goes to the learning algorithm. Then updates the search with those scores. The advantage being
Pro
Takes into account the learner and model bias
Con
Very slow since the speed of the learner is also included in the time.</p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

<p>Filtering Implementation</p>

A
<p>Information Gain
Variance, Entropy
GINI Index
NN where we remove low weights
"Useful" Features
Indepedent / Non_redundant</p>
25
Q

<p>Wrapping Implementation</p>

A

<p>Hill Climbing but taking the return scores from the learner
Randomized Optimization in general can be done
Forward Search

</p>

26
Q

<p>Forward Search</p>

A

<p>Search through the entire set and find the best, then try in combination with each of the other features, and continue until the increase in score plateau</p>

27
Q

<p>Back Ward Search</p>

A

<p>Try all combinations of n-1 features. Then n-2. Then stop at a point when the subset does pretty well. And the error dramatically increases.</p>

28
Q

<p>Relevance</p>

A

<p>xi is strongly relevant if remove it degrades Bayes Optimial Classifier
xi is weakly relevent if it is not strongly relevent and some subset of features S such that adding xi to S improves BOC
otherwise xi is irrelevant</p>

29
Q

<p>Bayes Optimal Classifier</p>

A

<p>Takes weighted average of all hypthosis

| BOC is best you can do on average if you can find it</p>

30
Q

<p>Relevance vs Usefulness</p>

A

<p>Relevance measures effect on BOC
Usefulness Measures effect on a particular predictor
Relevance aka information
Usefulness aka Error given Model/Learner</p>

31
Q

<p>Feature Transformation</p>

A

<p>Pre Process a set of features, to create a new set of features. THis should be smaller or more compact. Retain the relavent and useful information.</p>

32
Q

<p>Polysemy</p>

A

<p>Words mean many things, false positives</p>

33
Q

<p>Synonomy</p>

A

<p>The same thing can be represented by different words, false negatives</p>

34
Q

<p>Principal Components Analysis</p>

A

<p>Allows transformation intoa new space that can be used for feature selection, by looking at the eigenvalue. For instance anything with an eigen value of 0 would be removed. THis finds correlation, maximizng variance and allows reconstruction. I need to research more on this. </p>

35
Q

<p>Principal Components Analysis</p>

A

<p>Allows transformation into a new space that can be used for feature selection, by looking at the eigenvalue. For instance anything with an eigen value of 0 would be removed. This finds correlation, maximizing variance and allows reconstruction.

Mutually Orthogonal
Maximal Variance
Ordered Features
Bag of Features (because Ordered Features are a type of bag of features)

I need to research more on this. </p>

36
Q

<p>Independent Components Analysis</p>

A

<p>Independence of features. Creates new features from each base features through linear transformation that allows all the new features to be statistically independent. Mutual information of zero

Mutually Independent
Maximal Mutual Information
Bag of Features</p>

37
Q

<p>Random Components Analysis</p>

A

<p>Generates Random Directions and project the data out to these. It works well if the next thing is some type of classification. The projection is lower than N but not as much as PCA. The big advantage is speed.</p>

38
Q

<p>Linear Discriminant Analysis</p>

A

<p>Finds a projection that discriminates based on the label. Project into clumps or clusters. </p>

39
Q

<p>Three Types of Learning</p>

A

<p>Supervised Learning: y = f(x)
Like function approximation

Unsupervised Learning: f(x)
Clustering/Description Like Description

Reinforcement Learning: y = f(x), z
A lot like function approximation with the added z
</p>

40
Q

<p>Markov Decision Process</p>

A
<p>State: S
Model: T(s,a,s') ~ Pr(s'|s,a)
Actions: A(s), A
Reward: R(s), R(s,a) , R(s,a,s')
-----------------------------------------
Policy: pi(s) -> a</p>
41
Q

<p>Markov State</p>

A

<p>Is the possible representations inside of the world</p>

42
Q

<p>Markov Action</p>

A

<p>Things that a state can do, that when you are in a state that you can take</p>

43
Q

<p>Markov Model</p>

A

<p>Given state, state prime, and action it gives you the probability of state prime given state and action</p>

44
Q

<p>Markov Property</p>

A

<p>Only the present matters
Pr(s'|s,a). Only the most recent state matters
You can trick this to that the current state remembers everything it needs to know</p>

45
Q

<p>Markov Reward</p>

A

<p>R(s), R(s,a) , R(s,a,s') There are several ways to look at rewards. But might be better to think one way or another. We will focus on R(s). A reward for a certain state.</p>

46
Q

<p>Markov Policy</p>

A

<p>pi(s) -> a, a function that takes in a state and and gives you an action you should take. A "command". pi* is an optimized reward you should take to maximize reward.</p>

47
Q

<p>Temporal Credit Assignment Problem</p>

A

<p>This is the problem of assigning blame/credit of each move in a sequence when only the final state matters</p>

48
Q

<p>Utility of Sequences</p>

A

<p>If one sequence is of higher utility then all subset of the sequences are also of higher utility
This can be thought of as the sum of the rewards of each state</p>

49
Q

<p>Utility of State</p>

A

<p>Is the reward for that state, but also all the reward we could theoretically get for the rest of the sequence based on the policy.</p>

50
Q

<p>Finding the Policty</p>

A

<p>1) Start with a guess at the policy
2) Evaluate the policty by calcuating the utility with that policy
3) Improve the policy based on the utility function
because there is no max here, now this is a linear equation solve</p>

51
Q

<p>Finding the Policy (Policy Iteration)</p>

A

<p>1) Start with a guess at the policy
2) Evaluate the policy by calculating the utility with that policy
3) Improve the policy based on the utility function
because there is no max here, now this is a linear equation solve</p>

52
Q
A
53
Q

Modeler

A

Takes in Transistions and puts out Models

54
Q

Simulator

A

Takes in Model and Makes Transistions

55
Q

Learner

A

takes trancitions and puts out a policy

56
Q

PLaner

A

Takes in a model and puts out a policy

57
Q

Model Based Reinforcement Learning

A

Transistions to Modeler to Planner to Policy

58
Q

RL Based Planner

A

Model to Simulator to Transistions to Learner to Policy

59
Q

Three Main Approaches to RL

A

Policy Search - States are given and policy Derived

Value Function Based - Takes states to determine utility, states mapped to values

Model Based - (fairly direct learning) States and Actions take out