Classification Flashcards
week 7
classification probabilities
P(class|features) we’re essentially trying to understand the likelihood of an observation belonging to a particular class given its features.
Approaches to Classifcation
Generative classifiers:.
Generative classifiers try to understand how data is generated, modeling both the features and the classes together.
These generate probabilities P(class|predictors) by first estimating other distributions.
Rely on statistical theory like Bayes theorem.
Discriminative classifiers:
Discriminative classifiers focus on predicting the class directly based on the observed features, without necessarily understanding the underlying data generation process.
Estimate P(class|predictors) directly.
Also referred to as conditional classifiers.
prior probability for a class
The “prior probability for a class” refers to the probability of a particular class occurring before considering any evidence or features. It represents our initial belief or assumption about the likelihood of each class before we observe any data.
in classification problems, prior probabilities are used as a starting point for making predictions.
Posterior Probability for a Class:
P(j∣x) represents the probability of class j given a feature vector x. This is what we want to find out—it’s like the updated probability of each class after we’ve observed the features.
misclassification rate
The performance of a classifier is usually measured by its misclassification rate. The misclassification rate is the proportion of observations assigned to the wrong class.
Linear Discriminant Analysis
Linear Discriminant Analysis
- Linear Discriminant Analysis is often abbreviated to LDA.
- LDA applicable when all the features are quantifiable.
- We assume that fj is a (joint) normal probability distribution.
- In addition, we assume that the covariance matrix is the same from class to class.
- The classes are differentiated by locations of their means.
Kernel Discriminant Analysis (KDA):
KDA extends LDA by allowing for more complex, nonlinear decision boundaries between classes. It achieves this by mapping the data into a higher-dimensional space using a kernel function. KDA essentially “lifts” the data into a higher-dimensional space where the classes might be more easily separable and then applies LDA in this new space.
In the context of classification trees, the Gini index (or Gini coefficient) at a node is defined by
G=∑Ci=1(1−pi)pi
Explain why the Gini index is a measure of node impurity. As part of your answer you should define the meaning of the quantities pi
and C
in the above equation.
C
is the number of classes [1], and pi
is the proportion of observations at the node of class i
[1]. G
can be thought of as the probability of incorrectly classifying an observation at the node when they’re randomly reallocated [1]. So G=0
is smallest if pi=1
for some pi
(i.e. zero impurity), while G
is largest if each of the pi=1/C
are equal [1].
Give a short description of how the term fj(x)
is estimated in
linear discriminant analysis, and
kernel discriminant analysis.
In LDA we assume that fj
in each class follows a multivariate normal distribution with possibly different means μj
[1] but common covariance [1].
In KDA we estimate fj
using a kernel density estimate [2].
Describe the difference between prediction, classification and clustering problems.
Prediction and classification are where you have a data with a target variable either continuous (prediction) or categorical (classification) and are tasked with training a model to predict that target [1]. Clustering on the other hand does not have a target variable in the data. Instead the purpose is to group the data into clusters based on their similarity [1].
cost_complexity=0.005 argument.
The cost_complexity argument is to control the pruning of the tree. The tree will only allow a split if the improvement in fit is at least 0.005 [2].
augment() function
The augment() function is predicting on the test set. This has been used as we must assess performance on independent data.
Describe the aim of cluster analysis and give one application to demonstrate its use.
Cluster analysis is the method of data grouping, or data segmentation.
The aim of cluster analysis is to delineate ‘natural groups’ of data, with high within-class similarity and low between-class similarity.
example: investor class,customer segmentation for marketing purposes
Give three possible choices of distance measures for use in clustering, commenting on their applicability to different types of data.
Euclidean distance “as the crow flies”, straight line distance in variable space Pythagoras’ thereom: the square root of the sum of squared differences on each dimension most commonly used for continuous data
Manhattan distance “as the taxi drives”, simply the sum of the absolute differences in each dimension less affected by outliers
Mahalanobis distance takes into account covariances good for spherical clusters
Explain the difference between supervised and unsupervised classification. Which of these is k
-means cluster analysis, and why?
Supervised classification has a target variable; unsupervised does not. k-means is unsupervised, because it doesn’t have a target variable.
Briefly describe the steps of a hierarchical cluster analysis with complete linkage. Identify which step(s) that make it different from other linkage methods.
1 make distance matrix 2 choose the lowest distance 3 update the matrix by the highest distance of any two clusters (in 2) 4 repeat 2 and 3 until met the root node.
step 3 is what make it diff from other hierarchical cluster methods.
single linkage method
Single linkage method, also known as minimum linkage method, is a technique used in hierarchical clustering to determine the distance between two clusters. In single linkage clustering, the distance between two clusters is defined as the shortest distance between any two points in the two clusters
Explain why a complete (exhaustive) association rules analysis can be unfeasible when the number of items is large. What method can be used to ameliorate this problem and, in basic terms, how does it work?
There is an exponential increase in the number of rules with number of items.
The Apriori algorithm reduces number of rules under consideration by eliminating the itemsets that must be infrequent on the basis that a subset is infrequent.
Classification Methods
Goal is to assign each observation in test dataset to one of a (finite) number of pre-specified categories. Classification also referred to as supervised learning.
In association rule analysis, are rules with lift less than one potentially useful?
Yes. A rule with lift less than one means the left and right hand sides are negatively associated, which might be useful information. Also, a high-confidence low-lift rule might be useful, for example, for recommending a popular product. In association rule analysis, rules with lift less than one can still be potentially useful, but they often indicate weaker or negative associations between the items in the rule. Lift is a measure of how much more likely the antecedent and consequent of a rule occur together compared to what would be expected if they were independent.
confusion matrix.
Often useful to cross-tabulate the predicted classes versus the actual classes on a validation set
Visualization the classifier
This is a very easy problem for LDA to
solve.
The classes are very clearly separated.
They have roughly the same variance.
A bivariate normal distribution isn’t clearly awful.
LDA makes strong assumptions, and
when they hold it will do well.
LDA Dislikes Jam Donuts
Naive Bayes classification
Make the strong assumption that all the predictors are statistically independent within each class.Naive Bayes has high bias due to the independence assumption but low variance because it doesn’t try to fit the data too closely.
Logistic Regression
Logistic regression is one of a class of methods known as generalised linear models (GLMs). Logistic regression provides a means of estimating the probability of a 1 outcome in a binary response
based on predictor variables. When using logistic regression for making predictions, there are two different scales on which you can interpret the results:
Response Scale (Probability Scale)
Linear Predictor Scale (Logit Scale