# Data Mining, Classification Flashcards

Approaches to classification

Decision trees

bayesian classification

classification rules

neural networks

k-nearest nieghbours

SVM

Classification Requirements

- Accuracy
- Interpretability
- Scalability
- Noise and outlier management

Definition of classification

Given a collection of labels and one of data objects labelled, find a descripitve model that will enable to distinguish an object of one class from another.

Criteria to evaluate classification techniques

- Accuracy
- Efficiency
- Scalability
- Robustness
- Interpretability

Decision Trees, Hunt’s Algo

Dt -> set of training records that reach a node t.

Steps:

- if Dt contains records that belong to more than one class -> select the best attribute A on which to split Dt andbel nodet as A, split Dt in smaller subsets and recursively apply the procedure to each subset.
- If Dt contains records that belong to the same class yt then t is a leaf node labeled as yt
- if Dt is empty then t is a leaf node labelled as the default (most common) class.

Decision tree induction

Adopts a greedy strategy -> not global optimum

Issues:

- structure of test conditions (binary splits vs multiway split):
- depends on attribute type: nominal, ordinal, continuos.
- selection of the best attribute for the split
- stopping condition for the algorithm

Selection of best attribute in decision tree

Attributes with homogeneous class distribution are preferred, degree of impurity must be low.

Node impurity needs a measure:

- gini index
- entropy
- misclassification error

Different algortithms rely on different measures of impurity.

GINI index

- Used to compute node impurity in decision trees
- Maximum (1-1/n
_{c}) when records are equally distributed among all classes. - Minimum (0) when all records belong to one class.

Computing quality of a split using GINI

This technique is used in CART, SLIQ, SPRINT

When a node p is split into k partitions (children), the quality of the split is.

Categorical attribute:

- For each distinct value gather counts for each class in the dataset
- use the count matrix to make decisions

Continuos Attribute:

- Binary decision on one splitting value
- possible splittings = number of distinct values

- Each splitting value has a count matrix
- Sort attribute on value
- Linearly scan these values, each time updating the count and computing gini index
- Choose the split position that has the least gini index

Entropy impurity measure (INFO)

INFO

- max (log nc) when records are equally distributed
- min (0)
- Entropy computations are very similar to GINI index’s ones.

Information gain:

- measures reduction in entropy achieved because of the split.
- Disadvantage: tends to yield higher number of partitions, each small but pure.

Gain Ratio:

- Adjust information gain by the entropy of the partitioning. Higher entropy partitioning is penalized.
- Designed to overcome shortcomings of information gain.

Classification error impurity measure

Meausures classification error made by current node.

Same maximum and mimum of GINI index.

Stopping criteria for tree induction

Stop expanding node when:

- All records belong to the same class
- All the records have similar attribute values
- Early termination:
- pre-pruning
- post-pruning

Overfitting vs Underfitting,

Noise overfitting

Underfitting: model is too simple, training and test set errors are high.

Addressing overfitting

Pre-pruning (early stopping):

- Stop the algo before it becomes a fully-grown tree
- Typical stopping conditions for a node
- if all instances belong to same class
- all the attribute values are the same

- More restrictive conditions:
- Stop if cardinality of instances is less than some user-specidfied threshold
- Stop if class distribution are independent of the available features
- Stop if expanding the current node does not improve impurity measures

Post-pruning:

- Grow decision tree to its entirety
- Trim the nodes of the decision tree in a bottom up fashion
- If generalization error improves after trimming, replace sub-tree by a leaf node.
- Class label of leaf node is determined from majority class of instances in the sub-tree

Issues of decisions tree

- Data fragmentation
- Number of instances gets smaller as you travers down the tree, and that could be too small to make any statistically significant decision.

- Handling missing attribute values
- affect how impurity measures are computed
- affect how to distribute instance with missing value to child nodes
- Affect how a test instance with missing value is classified.

- Search strategy
- Finding optimal decision tree is NP-hard
- Algorithm presents so far uses a greedy, top-down, recursive partitioning strategy.

- Expressiveness
- Decision tree provide expressive representation for learning discrete-valued function (example: parity function)
- Not expressive enough for modeling continuos variables, in particular when test condition involves only a single attribute at a time.

- Tree Replication