L9: Unsupervised machine learning Flashcards

1
Q

Unsupervised ML

A

What: Works without a target outcome. Its more concerned with identifying associations within the data

Clustering:
Ability to process a set of previously unknown data points and create groups of them based on similarities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cluster analysis

A

Divides data into clusters
Similar items are placed in same cluster
* Intra-cluster differences are minimized
* Inter-cluster differences are maximized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can you do with clusters

A

1) looking into the results themselves can yield insights
* Customer segmentation
2) it can be input for predictive models, so categories become target variables
for classification, i.e. you can use distance to the closest cluster to classify a
new observation
3) anomaly detection
4) recommender systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K MEANS

A

1: Assign each point to its closest centroid

2: Recompute the controids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

IN PRACTICE

A

Because K means depends on random initial initialization, depending on the
data at hand, you might end up with suboptimal solutions
To avoid that: rerun the clustering (rule of thumb says 50-100 times, but ofc. It
also depends on data and time) and choose the solution with the lowest cost
function
TDLR: just because you got your algorithm to converge it doesn’t mean you
found the best way of clustering your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

HOW DO YOU DECIDE HOW MANY
CLUSTERS??

A

ELBOW-METHOD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SIMILARITY

A

Euclidean distance
Jaccard similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to choose similarity model?

A

1) domain knowledge- i.e. for text we use cosine
2) type of variable – Jaccard for nominal etc
3) if in doubt, use Euclidian, but always
4) check!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

HIERARCHICAL CLUSTERING

A

Se slide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Linking clusters: Four linking methods

A
  • Complete method
  • Single method
  • Average method
  • Centroid method
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • Complete method
  • Single method
  • Average method
  • Centroid method
A

Linking clusters: Four linking methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Complete method:

A

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities
o Tends to produce more average
trees
o Sensitive to noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Single method:

A

o Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities
o Tends to produce more
unbalanced trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Average method:

A

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities
o Tends to produce more balanced
trees
o Often considered best choice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Centroid method:

A

Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids

o Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids
o Can impose inversion problems
(i.e. similarity need not
decrease) and violates the
fundamental assumption that
small clusters are more coherent
than large clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CONSIDER SCALING!

A

o Data on different scales can cause undesirable results in clustering methods
o Solution is to scale data so that features have same mean and standard
deviation

17
Q

BUT HOW DO WE VALIDATE CLUSTERING

A

Clustering:
o Sum of least squares (scree plot)
o Silhouette width

18
Q

Anomaly detection

A

Manufacturing example:
You have data on two features for engine
performance – Vibration and Heat (!?)
You know that these engines haven’t failed
Can you figure out if a new engine might?

19
Q

ANOMALY DETECTION VS SUP.
LEARNING

A

Anomaly detection
Small number of positive
examples (fraud instances) (0
to 20)
Large number of negative
examples (people who don’t
commit fraud)
May different types of
anomalies – future anomalies
might look nothing like any of
the anomalous examples
we’ve seen so far
FRAUD

Supervised learning
Large number of positive and
negative examples
Enough positive examples for
an algorithm to get a sense of
what positives are like, future
positive examples likely to be
similar to the ones in training
set
SPAM

20
Q

RECOMMENDER SYSTEMS

A

Popularity:
- Recommend the most popular or tending item(s) to everyone

Content-based:
- Items are similar if their attributes are similar
- Often hand-engineered (domain-specific) attributes

Collaborative filtering
- Recommends items chosen by similar user
- Domain-free

21
Q

SUM UP: CLUSTER ANALYSIS
PIPELINE

A

1) data clean up:
* standardize/scale your variables
* Pay attention to outliers! (clustering forces all observations into a cluster)
2) choose type of clustering technique
* for hierarchical clustering
* Choose similarity measure
* Choose linkage type
* For k means
* Choose k
3) cluster
4) make sense of clusters 