Unsupervised Learning Flashcards

1
Q

reason for clustering

A

categorize data without labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

cluster searching

A
  1. search for closest cluster
  2. search within the cluster

** Note: much faster than searching each document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

k-means training

A
  1. initialize k random centers
  2. calculate points in each cluster
  3. recalculate the cluster centers using the mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

responsibility

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

fuzzy k-means (soft k-means) algorithm

A
  • reflects confidence in cluster
  1. initialize k random centers
  2. calculate responsibilities (soft max)
  3. calculate mean responsibilities (hard k-means is where r is 0 or 1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

k-means cost function

A
  • Euclidean - sensitive to scale and K, K=N -> 0 cost
  • Davies Bouldin index - low ‘inter’ std and high ‘intra’ std; lower better
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

where k-means fails

A
  1. donut
  2. assymetric (high var) gaussian
  3. smaller cluster in bigger one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

k-means problems

A
  1. choice of k
  2. local minimum is bad in this case
  3. sensitive to initial clusters
  4. can only find spherical clusters
  5. does take density into account
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ideal k value

A
  • where the steep meets the flat
  • not at the minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

hierarchical (agglomerative) clustering

A
  • greedy algorithm
  • merge two closest points until you have 1 group with all of the points
  • threshold the cluster intra distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

joining cluster

A
  1. single linkage: min distance between any two points
  2. complete linkage: max distance b/w any two points
  3. mean distance (UPGMA): mean distance between all pts
  4. Wards: minimize increase in variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

dendrogram

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

distances equations

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

valid distance metrics

A
  1. non-negative: d(x, y) ≥ 0, positive
  2. identity: (x, y) = 0 <=> x=y, definite
  3. symmetry: d(x, y) = d(y, x)
  4. subadditive: d(x, z) ≤ d(x, y) + d(y, z)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

self-organized maps

A
  • apply competitive learning as opposed to error-correction learning
  • dimension reduces to discretized representation of the input
  • called a map
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

gaussian mixure model (GMM)

A
  • form of density estimation
  • gives approximation of the probability distribution
  • use when notice multimodal data
  • p(x)=π1N(µ1,∑ 1)1+π2N(µ2,∑ 2)2
  • π=probility x belongs to to Gaussian k
  • π sums to 1
  • introduce latent Z for which π=p(Z=k)
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

GMM algorithm

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

responsibility k-means v. GMM

A
  • k-mean cluster use of softmax
  • GMM uses normal times probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

independent component analysis (ICA)

A
  • special type of blind source separation
  • cocktail party problem of listening in on one person’s speech in a noisy room
  • data: x=(x1,…,xm)T
  • hidden components: s=(s1,…,sn)T
  • mixing matrix: W n by n
  • unmixing matrix recovers source: W=A-1.
  • generative model (noiseless): x=As
  • noisy: x=As+n
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

benefits of GMM over fuzzy k-means

A
  • GMM has multiple π as opposed to one ß
  • it can handle elliptical whereas fuzzy k-means needs to be spherical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

GMM v. Fuzzy K-means

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

singular covariance problem

A
  • close points approach 0
    • division by 0 is singularity
  • local minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ways of dealing with singular covariance problem

A
  1. diagonal covariance -
    1. solution to singular covariance
    2. speed up computation (easy to take inverse)
  2. spherical gaussian
  3. shared covariance
  • each gaussian has the same covariance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

diagonal covariance

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
kernel density estimation
fitting PDFs with kernels; can use GMM
26
expectation-maximization (EM)
* general version of GMM * latent variables * coordinated descent * increasing likelihood guarantee (max likelihood)
27
dimensionality reduction
1) cleaner signal; 2) decreased transformation load -\> reduced time; 3) allow visualization of data
28
single value decomposition (SVD)
29
PCA
* Z=XQ * rotates the data (changes eigenvector and values), doesn't change the length * high variance = signal * low variance = noise * keep left most columns of Z (~95 %)
30
how does PCA help?
1. cleaner signal 2. decreased transformation load -\> reduced time 3. allow visualization of data
31
naive bayes connection to PCA
PCA makes IID assumption true
32
greedy layer-wise pretraining
* helps with supervised learning * only objective of layer is to train itself * only requires a few epochs of training to fine-tune
33
autoencoder
* J=|X-s(S(XW)W.T)|\*\*2 * non-linear PCA * linear activation gives PCA * predicts itself (auto=self) * square error for x \< 0 and x \> 0 * cross-entropy works for 0 ≤ x ≤ 1 * different biases for hidden and output layers * shared weights (type of regularization
34
types of autoencoders
1. Denoising autoencoder 2. Sparse autoencoder 3. Variational autoencoder (VAE) 4. Contractive autoencoder (CAE)
35
bottleneck network
* find useful and efficient representation * compress V dimension into D in the encoding phase
36
stacking
* we only care about the "inner" representation * discard the 2nd half each time * each layer smaller in size 1. train an autoencoder on X with HL output Z; 2. train a 2nd autoencoder on Z; 3. repeat
37
denoising autencoder
* an autoencoder that is missing information * attempt to address identity-function risk by randomly corrupting input * its like training with dropout * missing data not always represented by 0 * mask the cost in back propagation
38
sparse autoencoder
* when you train an autoencoder, the hidden units in the middle layer would fire (activate) too frequently, for most training samples * sparsity constraint: lower the activation rate so that they only activate for a small fraction of the training examples * sparse because each unit only activates to a certain type of inputs
39
how to reproduce missing data
* use autoencode or the likes * *
40
recommender systems
* google, youtube, netflix * use RBMs and autoencoders *
41
naming convention
* Recommender * N is instances * M is movies * K is latent variables * Typical Deep Learning * N is instances (same) * M is hidden units * K is number of classes * D is input dimensions
42
boltzmann machine
* everything is connected to everything * only works on trivial problems, unlike RBM * non-tractable
43
restricted boltzmann machine (RBM)
* bipartite graph * produces visible vector from hidden and vice versa * assume each hidden and output is IID * use sigmoid (to keep output 0-1) * act just like autoencoders * find latent variables * greedy pretraining * non-tractable, therefore, needs to be approximated hidden units only connected to the visible unit * square error for all X * cross-entropy works for 0 ≤ x ≤ 1
44
categorical RBM
45
categorical RBM shapes
* h = M * V = D x K * D is number of movies * K is the number of classes * W = D x K x M * b = D x K * x = M
46
deep belief network
deep network of stacked RBMs
47
latent variable
variable that is hidden
48
markov random field
* graph of states * each node is a state * state of each node only relies on adjacent node
49
relaxed bernouli constraint
in RBM, 1. threshold 2. let the values go to whatever
50
hidden markov model
# * unsupervised learning method * models sequences & classifies * modified using bayes rules to create separate model for each class * choose class with the highest likelihood
51
HMM applications
1. recognizing male or female voice 2. NLP: macbook was created by "p(x|x-, x--)" 3. stock predicition 4. SEO and bounce rate optimization 5. speech to text: words to sound
52
markov property
* p(x | all time) = p(x | x-) * strong assumption
53
joint probability
* chain rule of probability * p(s3, s2, s1) = p(s3|s2)p(s2|s1)p(s1)
54
markov order
* 1st order - p(y2 | y1) * 2nd order - p(y3 | y2, y1) * 3rd order - p(y4 | y3, y2, y1)
55
markov model
* M states * π = 1 x M, column vector * A = M x M weights * A(i, :).sum() = 1
56
smoothing
* add smoothing to account for unobserved * p = (count(x)+λ)/(N+λµ0) * can smooth average ratings too * r = (∑xi+λµ0)/(N+λ)
57
HMM model
1. πi = probability of starting at state i 2. A(i, j) 1. state transition matrix 2. probability of going to state j from i 3. P(tag1|tag0) 4. if row sum to 1, A is markov matrix 3. B(j, k) 1. observation probability matrix 2. probability of observing k given j 3. p(word|tag)
58
double embedded
* layer 1: Markov model * layer 2: choose observation
59
forward, forward-backward algorithm
60
backward, forward-backwards algorithm
61
forward backwards code
62
viterbi algorithm
* find the most probable hidden state sequence given the observed sequence * the same as forward backward algorithm except taking the max instead of sum * delta variable is like alpha * psi is similar to delta but argmax instead and no B term * have to backtrack the states
63
vertirbi initialization
64
vertirbi recursion
65
vertirbi termination
66
Baum-Welch algorithm
* Is similar to Gaussian mixture models * Use expectation maximization (iterative algorithm) instead of maximum likelihood * review at some other date, it is a very expensive algorithm and should be avoided if possible
67
variational autoencoder
* Bayersian ML + Deep Learning
68
generative adversarial network (GAN)
* 2 networks competing against eachother *
69
generative versus discriminitive
* generative: P(X|Y) * discriminitive: P(Y|X) * hard to tell what is happening with weights * not satisfactorial to statistician * research shows that discriminitive works better
70
types of recommender systems
* autorec: uses autoencoder to recommend missing data * non-personalized: * shows popular items * confidence * takes time into account * product associations * collaborative filter: what YOU and what OTHERS similar to you like
71
matrix factorization
* SVD example is an: X=WSUT * S is diagonal matrix of the variances * more variance = more important * we only want square error where rating is known * break up X into components for less parameters stored * use K size ~50-100 * Aij=log(Xij)~=wiTu
72
account for age in recommender system?
* hacker news: penalty\*(ups-downs-1)0.8/(age+2)gravity * gravity = 1.8 * age in hours * business rules: * penalize self, controversial and popular websites posts * reddit: sign(ups-downs)log{max(1, |ups-downs|)}+age/45000 * age in seconds
73
ranking with big systems
* discard users w/ few movies in common * sum over nearest neighbors * take ones with higheset weights * say k = 25 to 50 * closest absolute correlation * closest raw correlation * or use everybody
74
training matrix factorization
* J = ∑(tij-wiTuj) * dJ/dwi = ∑(tij-wiTuj)uj=∑ujujTwi * wi = (∑ujujT)-1∑rijujT * j is of Ψ * dJ/duj = ∑(tij-wiTuj)uj=∑ujujTwi * uj = (∑wjwjT)-1∑rijwi * j is of Ω * but really, rij=wiTuj+bi+cj+µ * cj = rij-wiTuj+bi+µ * bi = rij-wiTuj+cj+µ * µ = global average