Important 3: Clustering; Classification Flashcards

Question 1

Q

What are the two methods of distance-based clustering?

Answer

A

Hierachircal clusteirng
K-Mean-based clustering

–>minimize the discance between the group member while max. distance to members of other groups

Question 2

Q

What are the two methods of Model-based clustering?

General description of Model-based clustering?

Answer

A

Model-based clustering
Latent class analyis

–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics

Question 3

Q

How does Heararchical clustering work?

Answer

A

observations are group acc. to their similarity (distance matrix) clust method used complete linkage method

Question 4

Q

Dendogram: Hierarchical clusteirng
(distance-based clustering)

Answer

A

At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups

Height = dissiminlarity

Question 5

Q

What is k-Mean-based clustering?

Answer

A

(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group

–>centers need to be specified

Question 6

Q

What are the steps in (k-)mean-based clustering?

Answer

A

Choose number of clusters and maximum distance
–>requires numeric data
Find observation for cluster 1
Take second obersavtion if far enough from 1 –>Cluster
> Take next observation and compare with 1 and 2 (ggf. cluster 3)

Question 7

Q

What do k-means cluster plots show?
What are the limitations

Answer

A

whether it is possible to differentiate groups based on key variables

Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other

Question 8

Q

What is the problem with K-means cluster plots?

Answer

A

Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results

Question 9

Q

Key facts about model-based clustering?
(mclust)

Answer

A

observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
it clusters as being drawn from a mixture of normal distribution
Can only be used with numerical data

Question 10

Q

What is the Laten Class Analysis (LCA)?
(Model-based Clustering)

Answer

A

differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique

–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class

Question 11

Q

Steps in Latent-class analysis?
(Model-based clustering)

Answer

A

Variable scores are caused by the hidden groups
LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
It creates a probability of each observation belonging to each segment
Segment with highest probability is the segment where most observations are placed

Question 12

Q

Advantages of LCA?

Answer

A

Possible for complex data
Provides optimal number of clusters
Provided indicator for significant variables
Segment probability score
Provides diagnostic test for the best number of segments

Question 13

Q

What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)

Answer

A

BIC:Bayes information criterion (lower values better)
Error rate (better if lower)
Negative log likelihood (better if less negative)

Question 14

Q

What does Naive Bayes Classification do? (Supervised learning)

Answer

A

= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule

–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model

Question 15

Q

What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)

Answer

A

Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)

–>not providing class membership

Question 16

Q

Advantage of Random forrest model?

Answer

A

many classification trees(format) instead of one model
More accurate because more models are applied
Useful to estimate the importance of predictor variables

Question 17

Q

Classification trees are used for?

Answer

A

used to predict a categorical (and usually) binary dependent varible

Question 18

Q

What is the output of model-based clustering ?

Answer

A

Either shows the optimal number of clusters if g was not predetermined
- BIC
- Log-Liklihood

Question 19

Q

How can the solution of hierarchical clustering be tested?

Distance based clustering

Answer

A

Zooming in and focusing on certain branches of the dendogram
use the ceophentic correlation coefficient (CPC) = measure the correlation between the original dissimilarty and the cophentic distances
CPCC close to 1 –>strong positive correlation

Question 20

Q

How can the outcome of the k- mean based model be tested?

Distance based

Answer

A

Check mean values by ussing aggregate()
Plot k-mean cluster to chech if it is possible to differentiate groups based on key variables
Alternatively plot two continous variable by segment

Question 21

Q

How can the solution of the model based clusteirng be tested? ( mclust())
How to compare between models?

Answer

A

Check mean values by using aggregate()
PLot model based clusters
use other values for G, and compare the model outputs:
- Log liklihood –>less negative
- BIC –>lowest value
–>also used for model comparisoon

Question 22

Q

How can the solution of the Laten class analysis be tested?

Answer

A

Check mean values using aggregate()
Plot the LCA clusters
compare predicted class memberships

Question 23

Q

How can the outcome of the Naive Bayes Classification be tested?

Answer

A

use test data to predict() values based on trained model using test data
Revview the segment frequencies and compare to the inital a-priori frequencies based on the training data

Question 24

Q

How can the performance of the Naiive Bayes Classification and Random forest be assed?

Answer

A

Considering raw agreement rate
mean(raw$Segment == predition) = 0.92 –>92% correct prediction
compare performance against random chance using ARI
- 1 = perfect agreement, 0 = random -1 = complete disagreement
asses performance for each different class using Confusion Matrix
- actual segment is left (rows)
- predicted (columns

Question 25

Q

How can the outcome of the Random Forest be tested?

Answer

A

use test data to predict() values based on trained model using test data
plot the clusters based on test data

Question 26

Q

How can the performance of the Random forrest be assed?

Answer

A

compare performance against random chance using ARI (adjustedRandIndex)
1= perfect agreement, 0 = random -1 = complete disagreement
asses performance for each different class using Confusion Matrix using test data
- actual segment is left (rows)
- predicted (columns

Question 27

Q

What is meant by importance analysis in the Random Forest model?

Answer

A

the model uses many predictor variables, thus it is useful to know the importance of different classification variables

–>randomForest( importance = TRUE)

Question 28

Q

What is the Class imbalance? and how can it be resolved in RandomForrest models?

Answer

A

using randomForest for prediction the model might generate values with 90% being in one group –>imbalance

–>Resolving:
- looking at frequency table of the training data, to see the group allocation –>pick smallest
- SSet sampsize= “value” in randonForest model

Question 29

Q

What are the steps of Market basket analysis?

Answer

A

If Non-transaction data — can only handle discrete and categorical values
1. if numeric converted to ordered factors using cut()
cut(data$variable, breaks (cut points intervals=), labels= labels to resulting intervals (names), ordered_result= ordered factor? –>TRUE!!, right = FALSE = left closed!!

convert into formal transction object using as(x, “transactions”)
find associations rules using apriori()

–>inspect rules using inspect()
–> plot rule confidence against support

Question 30

Q

What is the output of Latent CLass analysis consist of?

Answer

A

Top: Conditiona item probabilities (for each predictor)
Bottom:
- estimated class population shares
- Predicted CLass membership (modal posterior prob)

Question 31

Q

What are the outputs of Naive Bayes?

Answer

A

Top: A-priori probabilitites per segement (class membership

Middle: Condition probabilities for each predictor

Question 32

Q

What are the outputs of random forest?

Answer

A

Confusion matrix with class error
OOB estimate of error rate:
but no predictor class membership

Question 33

Q

What is a formal transaction object?

Answer

A

Represents a set of transactions where each transaction consists of a unique identifier and a collection of items

Question 34

Q

How to get the estimated likelihoods for each respondent belonging to all different segments? (Naive bayes

Answer

A

Setzen Type= raw
All probabilities for each respondent for all segment

Brainscape's Knowledge GenomeTM

Important 3: Clustering; Classification Flashcards

Brainscape's Knowledge Genome^TM