# Data Mining: Association Rules Flashcards

Knowledge Discovery Process

Analysis techniques, methods

Descriptive methods:

Extract interpretable models describing data, for example client segmentation.

Predictive methods:

Exploidt some known variables to predict unknown or future values of variables, for example spam emails.

Attributes types

Nominal: ID, eye color, zip codes

Ordinal: Rankings, grades, size in {tall, medium, short}

Interval: calendar dates, temperatures in celsius

Ratio: temperature in Kelvin, length, time, counts

Nominal attributes possesses

Distinctness

Ordinal attribute possesses

distinctness, order

Interval attribute

Distinctness, order, addition

Ratio attribute

Distinctness, order, addition, multiplication

Data quality problems

Noise, outliers, missing values, duplicate data

Important characteristics of structured data

Dimensionality: curse of dimensionality

Sparsity: Only presence counts

Resolution: Patterns depend on the scale

Aggregation

Combining attributes into single one.

Purpose:

Data reduction: reducing attribute number ( sampling, feature selection, discretization)

Change of scale: from regions into states

Stability: aggregated data tends to be more stable (less deviation)

Sampling

Samping is necessary because employing the entire data set is too expensive.

Sampling works if the sample set is representative of the entire dataset.

A sample is representative if it has approximately the same property as the original set of data.

Types:

Simple Random: Randomly selected

Without replacement: An object can be taken only once

With replacement: Same object can be takes more than once.

Stratified: Split data into several partitions, take random samples from each partition

Dimensionality reduction

When dimensionality increases, data becomes more sparse in the space it occupies; Definition of distance and density between points become less meaningful. To prevent this we have dim. reduction:

Principal comp. analysis: find projection that captures largest amount of variation in data.

Singular value decomp.

Feature subset selection: remove redundant features and irrelevant features

Feature subset selection techniques

Bruteforce: try all possible subsets as input of data mining algo

Emebedded: features are selected naturally by the data mining algo

Filter: features selected before algorithm is run

Wrapper: use algorithms as black box to get best subset

Feature Creation

Create new attribute that represent better the inormation in the data set.

Feature Extraction: domain specific

Mapping Data to new space: for example Fourier Transform

Feature Construction: combine features

Discretization

Split attribute domain from continuos into discrete.

Reduces cardinality of attribute domain.

Techniques:

- N intervals with same width (Incremental, easy to do, can be badly affected by outliers and sparse data)
- N intervals approx same cardinality (non incremental approach, good for sparse data and outliers)
- Clustering (fits wel sparse data)

Attribute transformation

Function that maps attribute values to a new set of values;

Example: Normalization ( min-max, z-score, decimal scaling)

Similarity/Dissimilarity for simple attributes

Minkowski distance

r=1: city block (hamming distance)

r = 2: euclidean distance

r -> ∞: maximum distance between any component of the vectors.