Class Two Flashcards

1
Q

What is unsupervised machine learning?

A

Unsupervised machine learning is a type of machine learning where the algorithm learns patterns and structures in the data without being provided with explicit labels or target variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is K-Means clustering?

A

K-Means clustering is an unsupervised machine learning algorithm used for partitioning data into K clusters based on similarity. It aims to minimize the sum of squared distances between data points and their cluster centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the advantages of K-Means clustering?

A

Advantages of K-Means clustering include its simplicity, scalability to large datasets, and effectiveness in identifying well-separated spherical clusters.

  • Advantages:
  • Easy to implement
  • Adapts easily
  • Few hyperparameters
  • Disadvantages:
  • Does not scale well
  • Curse of dimensionality
  • Prone to overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When should you use K-Means clustering?

A

K-Means clustering is suitable when the data is continuous and there is a need to partition it into distinct groups based on similarity or proximity. It is useful for:

  1. Data preprocessing: Datasets frequently have missing values, but KNN can
    estimate for those values [Missing data imputation].
  2. Recommendation Engines: Using clickstream data from websites, KNN can
    provide automatic recommendations to users on additional content.
  3. Finance: Using KNN on credit data can help banks assess risk of a loan.
  4. Healthcare: Predicting risk of heart attacks and prostate cancer.
  5. Pattern Recognition: KNN has also assisted in identifying patterns in text
    and in digit classification.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the limitations of K-Means clustering?

A

Limitations of K-Means clustering include sensitivity to the initial placement of cluster centroids, the requirement to specify the number of clusters in advance, and the assumption of spherical clusters.

From slides:

  • Necessary to run algorithm several times to avoid sub-optimal solutions.
  • Need to specify the number of clusters.
  • Does not behave very well when the clusters have:
  • Varying sizes
  • Different densities
  • Non-spherical shapes.
  • Must scale input features before you run K-Means.
  • Scaling features improve the performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is DBSCAN clustering?

A

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised machine learning algorithm that groups data points into clusters based on density. It can find clusters of arbitrary shapes and handle outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the advantages of DBSCAN clustering?

A

Advantages of DBSCAN clustering include its ability to discover clusters of various shapes, its robustness to noise and outliers, and the ability to determine the number of clusters automatically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When should you use DBSCAN clustering?

A

DBSCAN clustering is suitable when the data has varying density, there are irregularly shaped clusters, and when noise or outliers need to be identified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the limitations of DBSCAN clustering?

A

Limitations of DBSCAN clustering include sensitivity to the choice of distance parameters, difficulty in handling data with varying densities, and the potential for producing overly complex clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is hierarchical clustering?

A

Hierarchical clustering is an unsupervised machine learning algorithm that creates a hierarchy of clusters. It iteratively merges or divides clusters based on their similarity, forming a tree-like structure called a dendrogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the advantages of hierarchical clustering?

A

Advantages of hierarchical clustering include its ability to reveal the hierarchical structure of the data, its flexibility in handling different similarity measures, and the visualization provided by dendrograms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When should you use hierarchical clustering?

A

Hierarchical clustering is suitable when the data has a hierarchical structure, and the goal is to explore relationships and similarities at different levels of granularity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the limitations of hierarchical clustering?

A

Limitations of hierarchical clustering include its computational complexity for large datasets, sensitivity to the choice of distance or similarity measures, and difficulty in handling noise and outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you determine the optimal number of clusters in K-Means clustering?

A

The optimal number of clusters in K-Means clustering can be determined using techniques such as the elbow method, silhouette analysis, or visual inspection of cluster quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the silhouette coefficient used for in clustering?

A

The silhouette coefficient is a measure of how well each data point fits into its assigned cluster in terms of both cohesion and separation. It ranges from -1 to 1, where higher values indicate better clustering quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between K-Means and Hierarchical Clustering?

A

K-Means Clustering is a partitioning-based algorithm that requires specifying the number of clusters in advance, while Hierarchical Clustering is an agglomerative or divisive algorithm that creates a hierarchy of clusters without the need for a predetermined number of clusters.

17
Q

What are the Three C’s of ML?

A

Three C’s of ML:
1. Collaborative filtering: is a technique for recommendations
2. Clustering: algorithms discover structure in collections of data.
3. Classification: is a form of ‘supervised’ learning.