Important 3: Clustering; Classification Flashcards

1
Q

What are the two methods of distance-based clustering?

A

Hierachircal clusteirng
K-Mean-based clustering

–>minimize the discance between the group member while max. distance to members of other groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two methods of Model-based clustering?

General description of Model-based clustering?

A

Model-based clustering
Latent class analyis

–>Model data so that the observed variance can be represented by a small group with specific distrib. characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does Heararchical clustering work?

A

observations are group acc. to their similarity (distance matrix) clust method used complete linkage method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dendogram: Hierarchical clusteirng
(distance-based clustering)

A

At the lowest level, the groups are combined into smaller groups that are relatively similar.
–>These groups are sucessively combiine with less similar groups

Height = dissiminlarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is k-Mean-based clustering?

A

(also k-mean clustering) = find groups based of sum-squares deviation from the multivariate center of the assigned group

–>centers need to be specified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps in (k-)mean-based clustering?

A
  1. Choose number of clusters and maximum distance
    –>requires numeric data
  2. Find observation for cluster 1
  3. Take second obersavtion if far enough from 1 –>Cluster
  4. > Take next observation and compare with 1 and 2 (ggf. cluster 3)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do k-means cluster plots show?
What are the limitations

A

whether it is possible to differentiate groups based on key variables

Limitation:
K-means requires arbitrary specification of clusters (use different values for k)
–>difficult to determine whether one solution is better than the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the problem with K-means cluster plots?

A

Difficult whether one sultion is better than another
–>Repeat analysis for several number of clusters to compare the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key facts about model-based clustering?
(mclust)

A
  • observation come from groups with different statistical distributions –>algorithm try to find best set of such underlying distribution
  • it clusters as being drawn from a mixture of normal distribution
  • Can only be used with numerical data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Laten Class Analysis (LCA)?
(Model-based Clustering)

A

differences are attributable to unobserved groups that one wishes to uncover (nclass is predetermined) –>Bayesian technique

–>Goal: estimate probabilities of membership in each class and assing individual to their most likely class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Steps in Latent-class analysis?
(Model-based clustering)

A
  1. Variable scores are caused by the hidden groups
  2. LCA posits a latent variable that maximizes liklihood of obserrving the scorces and the variables
  3. It creates a probability of each observation belonging to each segment
  4. Segment with highest probability is the segment where most observations are placed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Advantages of LCA?

A
  • Possible for complex data
  • Provides optimal number of clusters
  • Provided indicator for significant variables
  • Segment probability score
    Provides diagnostic test for the best number of segments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are diagnostics to test for statistical fit? ( Latent class analysis - model based clustering)

A
  • BIC:Bayes information criterion (lower values better)
  • Error rate (better if lower)
  • Negative log likelihood (better if less negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does Naive Bayes Classification do? (Supervised learning)

A

= Training data is used to learn probability of class membership as a function of each predictor variable considered independently –>using bayes rule

–>starts with observed probabilities of vairbales conditiona on segments found in the training data
—>only uses one model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Random forrest classification (compared to Naive bayes classification? (Supervised learning)

A

Instead of unsing a sinlge model, it builds and ensemlbe of models that jointly classify the data by fitting many classification trees (forrest)

–>not providing class membership

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Advantage of Random forrest model?

A
  • many classification trees(format) instead of one model
  • More accurate because more models are applied
  • Useful to estimate the importance of predictor variables
17
Q

Classification trees are used for?

A

used to predict a categorical (and usually) binary dependent varible

18
Q

What is the output of model-based clustering ?

A

Either shows the optimal number of clusters if g was not predetermined
- BIC
- Log-Liklihood

19
Q

How can the solution of hierarchical clustering be tested?

Distance based clustering

A
  • Zooming in and focusing on certain branches of the dendogram
  • use the ceophentic correlation coefficient (CPC) = measure the correlation between the original dissimilarty and the cophentic distances
  • CPCC close to 1 –>strong positive correlation
20
Q

How can the outcome of the k- mean based model be tested?

Distance based

A
  1. Check mean values by ussing aggregate()
  2. Plot k-mean cluster to chech if it is possible to differentiate groups based on key variables
  3. Alternatively plot two continous variable by segment
21
Q

How can the solution of the model based clusteirng be tested? ( mclust())
How to compare between models?

A
  1. Check mean values by using aggregate()
  2. PLot model based clusters
  3. use other values for G, and compare the model outputs:
    - Log liklihood –>less negative
    - BIC –>lowest value
    –>also used for model comparisoon
22
Q

How can the solution of the Laten class analysis be tested?

A
  1. Check mean values using aggregate()
  2. Plot the LCA clusters
  3. compare predicted class memberships
23
Q

How can the outcome of the Naive Bayes Classification be tested?

A
  1. use test data to predict() values based on trained model using test data
  2. Revview the segment frequencies and compare to the inital a-priori frequencies based on the training data
24
Q

How can the performance of the Naiive Bayes Classification and Random forest be assed?

A
  1. Considering raw agreement rate
    mean(raw$Segment == predition) = 0.92 –>92% correct prediction
  2. compare performance against random chance using ARI
    - 1 = perfect agreement, 0 = random -1 = complete disagreement
  3. asses performance for each different class using Confusion Matrix
    - actual segment is left (rows)
    - predicted (columns
25
Q

How can the outcome of the Random Forest be tested?

A
  1. use test data to predict() values based on trained model using test data
  2. plot the clusters based on test data
26
Q

How can the performance of the Random forrest be assed?

A
  1. compare performance against random chance using ARI (adjustedRandIndex)
    1= perfect agreement, 0 = random -1 = complete disagreement
  2. asses performance for each different class using Confusion Matrix using test data
    - actual segment is left (rows)
    - predicted (columns
27
Q

What is meant by importance analysis in the Random Forest model?

A

the model uses many predictor variables, thus it is useful to know the importance of different classification variables

–>randomForest( importance = TRUE)

28
Q

What is the Class imbalance? and how can it be resolved in RandomForrest models?

A

using randomForest for prediction the model might generate values with 90% being in one group –>imbalance

–>Resolving:
- looking at frequency table of the training data, to see the group allocation –>pick smallest
- SSet sampsize= “value” in randonForest model

29
Q

What are the steps of Market basket analysis?

A

If Non-transaction data — can only handle discrete and categorical values
1. if numeric converted to ordered factors using cut()
cut(data$variable, breaks (cut points intervals=), labels= labels to resulting intervals (names), ordered_result= ordered factor? –>TRUE!!, right = FALSE = left closed!!

  1. convert into formal transction object using as(x, “transactions”)
  2. find associations rules using apriori()

–>inspect rules using inspect()
–> plot rule confidence against support

30
Q

What is the output of Latent CLass analysis consist of?

A
  1. Top: Conditiona item probabilities (for each predictor)
  2. Bottom:
    - estimated class population shares
    - Predicted CLass membership (modal posterior prob)
31
Q

What are the outputs of Naive Bayes?

A

Top: A-priori probabilitites per segement (class membership

Middle: Condition probabilities for each predictor

32
Q

What are the outputs of random forest?

A
  1. Confusion matrix with class error
  2. OOB estimate of error rate:
  3. but no predictor class membership
33
Q

What is a formal transaction object?

A

Represents a set of transactions where each transaction consists of a unique identifier and a collection of items

34
Q

How to get the estimated likelihoods for each respondent belonging to all different segments? (Naive bayes

A

Setzen Type= raw
All probabilities for each respondent for all segment