Final Exam: Predictive Analytics Flashcards
What is Clustering?
The process of grouping a set of data points into Clusters.
What is Intra-Cluster Distance?
Data within the same cluster will be similar
What is inter-cluster distance?
Data from different clusters will be different.
What are the inputs and outputs of clustering?
Input: Unlabeled data.
Output: Assign each data point to a cluster.
Is Clustering Supervised or Unsupervised Learning?
Unsupervised Learning
What are the motivations of clustering?
- Clustering breaks a large heterogeneous population into homogeneous subgroups
- Common step in the exploratory data analysis (EDA) to gain insights
- Useful to reduce data dimension when analyzing high-dimensional data
What are some applications of clustering?
Targeted advertising, recommendation, and fraud detection.
What is k-means clustering?
A clustering algorithm that partitions a data set into K distinct, non-overlapping clusters.
What is the process for K-Means algorithm?
The process that involves iteratively updating the cluster centroids and group assignments.
1. Randomly assign each point to one of the K clusters
2. Iterate until (1) no change in assignments or (2) reaching the pre-set max number of iterations
What is the Euclidean Distance?
The distance between the k-th point and l-th point is xk - xl squared plus other iteration underneath the square root of the total.
What are some practical details to K-means clustering?
- K is a hyperparameter, which is pre-set before running the algorithm.
- Results may vary based on random initial centroid selection. Run the algorithm multiple times and select the most reasonable result
- A good clustering produces high-quality clusters where (1) Intra-cluster distance is small (2) Inter-cluster distance is large (3) Information gained is meaningful and can be applied
What are some limitations to k-means clustering?
- Different sizes
- Different density
- Non-spherical shapes
What is hierarchical clustering?
The process that results in an attractive tree-based representation of the data points, called dendrogram.
What are the characteristics of a dendrogram?
- A dendrogram is built (interpreted) starting from the leaves in the bottom.
- Each leaf contains one data point
- Moving up, some leaves fuse into branches
- Fused leaves are “similar” to each other
- The earlier (lower in the tree) fusions occur, the more similar the leaves are
How to choose the number of clusters in a dendrogram?
Make a horizontal cut across the dendrogram.
What are some practical details of hierarchical clustering?
- Similar to K-means, we also need to configure the intra-cluster distance using euclidean/correlation and the inter-cluster distance (linkage) using complete, single, and average
- Part of EDA, Understand the implications of the cut, and choose the one that serves the best for the core problem.
What key attributes are we investigating in regards to clusters in RapidMiner?
- Average with cluster distance: inter-cluster
- Davies Bouldin Index (DBI): A combination of intra- and inter- cluster distances (The smaller, the better)
- The purity of each cluster according to a certain variable (The purer, the better)
What is a support vector machine?
A supervised learning algorithm that was developed for classification problems and is often one of the best “out of the box” classifiers.
What is the intuition behind a support vector machine?
To look for a hyperplane separating two classes.
What is a hyperplane?
A decision boundary used for a classification and regression.
f(x1, x2) = B0 + B1x1 + B2x2
What is a prediction for a hyperplane observation?
Predictions are made by checking which side of the hyperplane a point falls on.
If f(xi1, xi2) is greater than or equal to 0, then y = 1.
If f(xi1, xi2) is less than 0, then y = -1.
What is machine learning?
The science (art) of programming computers so they can learn from data, and generalize/predict unknown information.
What is unsupervised learning?
Data does that does NOT contain labels.
Goal: Understand data pattern.
What is supervised learning?
Data that contains labels.
Goal: Predict label.