Final Exam: Predictive Analytics Flashcards

1
Q

What is Clustering?

A

The process of grouping a set of data points into Clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Intra-Cluster Distance?

A

Data within the same cluster will be similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is inter-cluster distance?

A

Data from different clusters will be different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the inputs and outputs of clustering?

A

Input: Unlabeled data.
Output: Assign each data point to a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is Clustering Supervised or Unsupervised Learning?

A

Unsupervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the motivations of clustering?

A
  1. Clustering breaks a large heterogeneous population into homogeneous subgroups
  2. Common step in the exploratory data analysis (EDA) to gain insights
  3. Useful to reduce data dimension when analyzing high-dimensional data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some applications of clustering?

A

Targeted advertising, recommendation, and fraud detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is k-means clustering?

A

A clustering algorithm that partitions a data set into K distinct, non-overlapping clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the process for K-Means algorithm?

A

The process that involves iteratively updating the cluster centroids and group assignments.
1. Randomly assign each point to one of the K clusters
2. Iterate until (1) no change in assignments or (2) reaching the pre-set max number of iterations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Euclidean Distance?

A

The distance between the k-th point and l-th point is xk - xl squared plus other iteration underneath the square root of the total.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some practical details to K-means clustering?

A
  1. K is a hyperparameter, which is pre-set before running the algorithm.
  2. Results may vary based on random initial centroid selection. Run the algorithm multiple times and select the most reasonable result
  3. A good clustering produces high-quality clusters where (1) Intra-cluster distance is small (2) Inter-cluster distance is large (3) Information gained is meaningful and can be applied
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some limitations to k-means clustering?

A
  1. Different sizes
  2. Different density
  3. Non-spherical shapes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is hierarchical clustering?

A

The process that results in an attractive tree-based representation of the data points, called dendrogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the characteristics of a dendrogram?

A
  1. A dendrogram is built (interpreted) starting from the leaves in the bottom.
  2. Each leaf contains one data point
  3. Moving up, some leaves fuse into branches
  4. Fused leaves are “similar” to each other
  5. The earlier (lower in the tree) fusions occur, the more similar the leaves are
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to choose the number of clusters in a dendrogram?

A

Make a horizontal cut across the dendrogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some practical details of hierarchical clustering?

A
  1. Similar to K-means, we also need to configure the intra-cluster distance using euclidean/correlation and the inter-cluster distance (linkage) using complete, single, and average
  2. Part of EDA, Understand the implications of the cut, and choose the one that serves the best for the core problem.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What key attributes are we investigating in regards to clusters in RapidMiner?

A
  1. Average with cluster distance: inter-cluster
  2. Davies Bouldin Index (DBI): A combination of intra- and inter- cluster distances (The smaller, the better)
  3. The purity of each cluster according to a certain variable (The purer, the better)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a support vector machine?

A

A supervised learning algorithm that was developed for classification problems and is often one of the best “out of the box” classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the intuition behind a support vector machine?

A

To look for a hyperplane separating two classes.

20
Q

What is a hyperplane?

A

A decision boundary used for a classification and regression.
f(x1, x2) = B0 + B1x1 + B2x2

21
Q

What is a prediction for a hyperplane observation?

A

Predictions are made by checking which side of the hyperplane a point falls on.
If f(xi1, xi2) is greater than or equal to 0, then y = 1.
If f(xi1, xi2) is less than 0, then y = -1.

22
Q

What is machine learning?

A

The science (art) of programming computers so they can learn from data, and generalize/predict unknown information.

23
Q

What is unsupervised learning?

A

Data does that does NOT contain labels.
Goal: Understand data pattern.

24
Q

What is supervised learning?

A

Data that contains labels.
Goal: Predict label.

25
Examples of supervised learning?
1. Detect tumors in brain scans: yes or no? 2. Recognize a handwriting number: 0, 1, ....., 9? 3. Forecast your company's revenue next year, based on many performance metrics
26
Examples of unsupervised learning?
1. Segment customers based on their purchases so that you can design a different marketing strategy for each segment. 2. Represent a high-dimensional dataset in a clear and insightful diagram 3. Understand the associations between products, such as how likely a customer is to buy milk after buying bread
27
What is regression?
A supervised learning model utilized when the label variable is numerical.
28
What is classification?
A supervised learning model when the label variable is categorial.
29
What does the Mean Squared Error tell us?
The average of the squared differences between predicted and actual values, providing a measure of how well a model's predictions align with the real data, with lower values indicating better model fit.
30
What is training data?
Data utilized to train the prediction model?
31
What is test data?
Data utilized to evaluate the accuracy, precision, and recall prediction model using new data that is "fresh and unseen"
32
What kind of model is the Right Fit for General Data?
A model where Training is Low and Test is Low.
33
What kind of model is an overfit for the general data?
A model where Training is Low and Test is High.
34
What kind of model is an underfit for the general data?
A model where Training is High and Test is Low.
35
What is a decision tree?
A decision is a popular and versatile prediction model that can perform both classification and regression tasks.
36
Examples of decision nodes?
A root node and internal node.
37
What are branches?
They are lines connecting nodes that specify the possible outcomes.
38
What are leaf nodes?
The model's predictions.
39
What is a confusion matrix?
A table that summarizes the performance of a classification model by comparing its predictions against the actual values.
40
What is accuracy?
The percentage of predictions that match the actual labels. Equation: True Positives + True Negative/The Entire Dataset
41
What is precision?
The percentage of true positives identified from a model's predicted positives. Equation: TP/TP + FP
42
What is recall?
The percentage of true positives identified from a model's actual positives. Equation: TP/TP + FN
43
How would you describe precision to someone with no technical background?
When I predict _________, 98% of my predictions are truly __________. When I predict _________, 94% of my predictions are truly __________.
44
How would you describe recall to someone with no technical background?
Among all the acceptable cars, my prediction model can identify 92% of them.
45
What does R Squared tell us in this dataset?
How well a model fits a dataset. In other words, it determines whether a model has strong or weak predictive powers