G6. Unsupervised learning Flashcards

1
Q

What is unsupervised learning? Explain its general principle.

A

Unsupervised learning is a machine learning technique in which the data consists of a set of input vectors X without any corresponding output values Y. The user doesn’t need to supervise the model, it discovers patterns by its own. General principal clustering: observe, identify patterns, classify patterns and drive conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the objective of unsupervised learning?

A

Find hidden structure in unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of questions can unsupervised learning methods answer? Give examples
or use cases.

A

Clustering.
Reduce dimensionality
Outliers’ detection
Novelty criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data points are partitioned into groups. (UL)

A

Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Dimensionality reduction (UL)

A

find actual variables that are representative to an observation. Principal component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Find unusual events that distinguish part of the data from the rest according to certain criteria (UL)

A

Outliers detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Deals with cases when changes occur in the data (UL)

A

Novelity criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

High interclass similarity

A

Same cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Low interclass similarity

A

Different cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We are going to consider that our obs are records of variables, so we can associate them to a multidimensional vector belonging a multidimensional space. Every dimension is represented by a variable. If I have gender age and income my space will be a 3 dim space and all the observations are vectors within this space (3 value vector). Vector= obs contained in a data collection. Im going to choose using some strategy vectors that I consider representative form the data collection. If I have a queue randomly I choose out from all the vectors some representative ones called centroids. Then for every centroid and for every remaining data collection I will compute the distance between te data and the centroid. If they are close (Euclidean distance) they are from the same cluster if not is not. I need thresholds to define close or far. Once I’ve done that for every element od the data collection I will end up with cluster. To determine the quality of my class is how close are they to my centroid (I want them as close to call them same custer) and how far are each cluster form each other (as bigger the difference the better to realy say they are different clusters). As far as better and what happens with the data that couldn’t be clustered? Are they outliers?
How to define #cluster or identify centroids -> random or expert

A

Describe the general principle of the K-Means clustering algorithm?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain which measures can be used for assessing the result for applying such
algorithm on data?

A

To determine the quality of my class is how close are they to my centroid (I want them as close to call them same cluster) and how far are each cluster form each other (as bigger the difference the better to really say they are different clusters).
How to measure: Euclidean distance, inertia, sum of squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of visualisation of results of the K-Means algorithm applied to a data
collection?

A
  1. Show how data points are assigned to different clusters
  2. Illustrate the position of cluster centroids
  3. Determine the optimal number of clusters (k)
  4. Evaluate the quality of clustering
  5. Understand the characteristics of each cluster
  6. For time-series or spatial data, visualization helps in understanding the evolution or distribution of clusters over time or space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly