Classification Flashcards by Tatiana Boyko

classification probabilities

P(class|features) we’re essentially trying to understand the likelihood of an observation belonging to a particular class given its features.

How well did you know this?

Not at all

Perfectly

Approaches to Classifcation

Generative classifiers:.
Generative classifiers try to understand how data is generated, modeling both the features and the classes together.
These generate probabilities P(class|predictors) by first estimating other distributions.
Rely on statistical theory like Bayes theorem.

Discriminative classifiers:
Discriminative classifiers focus on predicting the class directly based on the observed features, without necessarily understanding the underlying data generation process.
Estimate P(class|predictors) directly.
Also referred to as conditional classifiers.

How well did you know this?

Not at all

Perfectly

prior probability for a class

The “prior probability for a class” refers to the probability of a particular class occurring before considering any evidence or features. It represents our initial belief or assumption about the likelihood of each class before we observe any data.

in classification problems, prior probabilities are used as a starting point for making predictions.

How well did you know this?

Not at all

Perfectly

Posterior Probability for a Class:

P(j∣x) represents the probability of class j given a feature vector x. This is what we want to find out—it’s like the updated probability of each class after we’ve observed the features.

How well did you know this?

Not at all

Perfectly

misclassification rate

The performance of a classifier is usually measured by its misclassification rate. The misclassification rate is the proportion of observations assigned to the wrong class.

How well did you know this?

Not at all

Perfectly

Linear Discriminant Analysis

Linear Discriminant Analysis
- Linear Discriminant Analysis is often abbreviated to LDA.
- LDA applicable when all the features are quantifiable.
- We assume that fj is a (joint) normal probability distribution.
- In addition, we assume that the covariance matrix is the same from class to class.
- The classes are differentiated by locations of their means.

How well did you know this?

Not at all

Perfectly

Kernel Discriminant Analysis (KDA):

KDA extends LDA by allowing for more complex, nonlinear decision boundaries between classes. It achieves this by mapping the data into a higher-dimensional space using a kernel function. KDA essentially “lifts” the data into a higher-dimensional space where the classes might be more easily separable and then applies LDA in this new space.

How well did you know this?

Not at all

Perfectly

In the context of classification trees, the Gini index (or Gini coefficient) at a node is defined by

G=∑Ci=1(1−pi)pi

Explain why the Gini index is a measure of node impurity. As part of your answer you should define the meaning of the quantities pi
and C
in the above equation.

C
is the number of classes [1], and pi
is the proportion of observations at the node of class i
[1]. G
can be thought of as the probability of incorrectly classifying an observation at the node when they’re randomly reallocated [1]. So G=0
is smallest if pi=1
for some pi
(i.e. zero impurity), while G
is largest if each of the pi=1/C
are equal [1].

How well did you know this?

Not at all

Perfectly

Give a short description of how the term fj(x)
is estimated in

linear discriminant analysis, and
kernel discriminant analysis.

In LDA we assume that fj
in each class follows a multivariate normal distribution with possibly different means μj
[1] but common covariance [1].

In KDA we estimate fj
using a kernel density estimate [2].

How well did you know this?

Not at all

Perfectly

Describe the difference between prediction, classification and clustering problems.

Prediction and classification are where you have a data with a target variable either continuous (prediction) or categorical (classification) and are tasked with training a model to predict that target [1]. Clustering on the other hand does not have a target variable in the data. Instead the purpose is to group the data into clusters based on their similarity [1].

How well did you know this?

Not at all

Perfectly

cost_complexity=0.005 argument.

The cost_complexity argument is to control the pruning of the tree. The tree will only allow a split if the improvement in fit is at least 0.005 [2].

How well did you know this?

Not at all

Perfectly

augment() function

The augment() function is predicting on the test set. This has been used as we must assess performance on independent data.

How well did you know this?

Not at all

Perfectly

Describe the aim of cluster analysis and give one application to demonstrate its use.

Cluster analysis is the method of data grouping, or data segmentation.

The aim of cluster analysis is to delineate ‘natural groups’ of data, with high within-class similarity and low between-class similarity.

example: investor class,customer segmentation for marketing purposes

How well did you know this?

Not at all

Perfectly

Give three possible choices of distance measures for use in clustering, commenting on their applicability to different types of data.

Euclidean distance “as the crow flies”, straight line distance in variable space Pythagoras’ thereom: the square root of the sum of squared differences on each dimension most commonly used for continuous data
Manhattan distance “as the taxi drives”, simply the sum of the absolute differences in each dimension less affected by outliers
Mahalanobis distance takes into account covariances good for spherical clusters

How well did you know this?

Not at all

Perfectly

Explain the difference between supervised and unsupervised classification. Which of these is k
-means cluster analysis, and why?

Supervised classification has a target variable; unsupervised does not. k-means is unsupervised, because it doesn’t have a target variable.

How well did you know this?

Not at all

Perfectly

Briefly describe the steps of a hierarchical cluster analysis with complete linkage. Identify which step(s) that make it different from other linkage methods.

Study These Flashcards

1 make distance matrix 2 choose the lowest distance 3 update the matrix by the highest distance of any two clusters (in 2) 4 repeat 2 and 3 until met the root node.

step 3 is what make it diff from other hierarchical cluster methods.

single linkage method

Study These Flashcards

Single linkage method, also known as minimum linkage method, is a technique used in hierarchical clustering to determine the distance between two clusters. In single linkage clustering, the distance between two clusters is defined as the shortest distance between any two points in the two clusters

Explain why a complete (exhaustive) association rules analysis can be unfeasible when the number of items is large. What method can be used to ameliorate this problem and, in basic terms, how does it work?

Study These Flashcards

There is an exponential increase in the number of rules with number of items.

The Apriori algorithm reduces number of rules under consideration by eliminating the itemsets that must be infrequent on the basis that a subset is infrequent.

Classification Methods

Study These Flashcards

Goal is to assign each observation in test dataset to one of a (finite) number of pre-specified categories. Classification also referred to as supervised learning.

In association rule analysis, are rules with lift less than one potentially useful?

Study These Flashcards

Yes. A rule with lift less than one means the left and right hand sides are negatively associated, which might be useful information. Also, a high-confidence low-lift rule might be useful, for example, for recommending a popular product. In association rule analysis, rules with lift less than one can still be potentially useful, but they often indicate weaker or negative associations between the items in the rule. Lift is a measure of how much more likely the antecedent and consequent of a rule occur together compared to what would be expected if they were independent.

confusion matrix.

Study These Flashcards

Often useful to cross-tabulate the predicted classes versus the actual classes on a validation set

Visualization the classifier

Study These Flashcards

This is a very easy problem for LDA to
solve.
The classes are very clearly separated.
They have roughly the same variance.
A bivariate normal distribution isn’t clearly awful.
LDA makes strong assumptions, and
when they hold it will do well.

LDA Dislikes Jam Donuts

Naive Bayes classification

Study These Flashcards

Make the strong assumption that all the predictors are statistically independent within each class.Naive Bayes has high bias due to the independence assumption but low variance because it doesn’t try to fit the data too closely.

Logistic Regression

Study These Flashcards

Logistic regression is one of a class of methods known as generalised linear models (GLMs). Logistic regression provides a means of estimating the probability of a 1 outcome in a binary response
based on predictor variables. When using logistic regression for making predictions, there are two different scales on which you can interpret the results:

Response Scale (Probability Scale)
Linear Predictor Scale (Logit Scale

Deviance

Deviance is logistic regression analogue of sum of squares in linear regression

ROC (Receiver Operating Characteristic)

ROC (Receiver Operating Characteristic) curve describes how the sensitivity and specificity change as the threshold changes.

SCALE OF K-Nearest neighbour classi cation

Euclidean distance is often used between observations to determine the most similar items.

K-Nearest neighbour classi cation

Similar to missing value imputation. Find data points in the training set that are most similar to the test observation at hand. Assign the test case to the most common class for these proximate training data

Breaking ties

1. first way we can get ties is in finding Neighbours. for example k=5, we want to find 5 nearest neighbors, what happened when it turns out that there are 6th or 7th neighbors that equally close to the observations as the 5th neighbor. it could happened when we have numerical data which could be rounded or whenm have categorical predictors (when the answer certain type relates to certain class). Advise in this case to use 7 neighbors. We can pick these neighbors randomly. pic up 5 Neighbours from 7, we will get different classification every time What happen when class A have 2 obersvations, class B also have two observations and class C has one. In this situation we assign the class randomly to the top votes classes. we randomly chose A or B class

how many neighbors should we use

Small will give larger variation as few observations in the training data are used to predict. But, those observations will be relative close to the test data so will likely have lower bias. k Larger will give less variation based on the majority vote of a large number of observations from the training data. However, larger will have bias as we will be comparing the test data to observations that have greater disparity in the pattern of predictors.

Classification Trees

Similar to regression trees. Instead of the mean prediction at each node, we use the majority vote of class at each node. Instead of minimising RSS to find the next optimal split, we instead minimise a measure of node impurity.

nODE Impurity

Impurity is then a measure of how close the node is to having observations equally distributed among all classes

Node purity

Purity measures how close a node is to being observations of just one class.

Gini splitting index

We can measure impurity using the Gini splitting index, which measures the chance an observation from the node would be misclassified if all the class labels were randomly reallocated within that node.

Random forests for classification

Forests for classification are just collections of classification trees.Can result in ties, which are usually resolved at random

Synthetic minority over-sampling technique.

The idea is to generate some more observations in the under-sampled class using -nearest neighbour sampling. k We can optionally also downsample the over-sampled class.

Classification Flashcards

week 7 (38 cards)