4 - Unsupervised learning: clustering and association rules Flashcards
What is datamining?
- methods to extract information from large data sets using algorithms
- considers data where each (denormalized) entry redundantly describes all attributes of an instance
What is an algorithm?
A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer
Some applications of Datamining
CRM
- profiles, cross-selling, bundling
Web Optimisation
- click paths, search
Traffic forecasts
Fraud detection
Product Placement
Data mining methods
- data pre-processing, description and reduction
- search explanations through regression
- segmentation through cluster analysis
- reveal relationships through visualization, correlation, association rule learning
- classification, discriminant analysis
- anomaly recognition
- prognosis
Data Mining Methods
Components from …
traditional statistics and data analysis
- e.g. regression, clustering, factor analysis
artificial intelligence
- e.g. machine learning, artificial neural networks, evolutionary algorithms
pattern recognition
- e.g. artificial neurons as identifiers
data base theory and practice
- e.g. association analysis, OLAP macros
What is an attribute?
provides information on one specific characteristic of each instance
e.g. the column stating the patients’ name for each recorded visit
What is an instance?
One specific entry in the data set
e.g. the row describing one patients’ specific visit
What is a data set?
All the data that you plan to analyse
e.g. all patients who visited this hospital during the last 5 years
Denormalised: one long table
Evaluating Data Mining
To evaluate the outcome of data mining efforts, we usually split the data set into three parts:
Training set:
- Use an algorithm to generate a model from the data
Validation set:
- Validate and improve the model
Test set:
- Test the applicability and performance of the model
Supervised learning
For training, validation and test sets, the solution is known a priori
e.g. classification
Unsupervised learning
There is no structured solution (yet)
e.g. clustering, association rules
What is clustering and when is it applied?
Unsupervised segmentation of an instance set based on attributes
Is applied, when instance segments are not known a priori, but the possibility of segmentation is expected
Clustering
Conditions
- instances do belong to different segments
- instances are described by multi-dimensional attributes
- attribute values can be quantified, normalized or weighted
Clustering
Unique
Every instance belongs to exactly one cluster
e.g. bank note identification
Clustering
Overlapping
One Instance can belong to several clusters
e.g. expert profiles
Clustering
Probabilistic
Each instance has a certain probability of belonging to each cluster
e.g. customer segments
What is hierarchical clustering?
Cluster on lower aggregation levels are part of clusters on higher aggregation levels
e.g. biological classification by family, genus, species
Hierarchical clustering
Agglomerative clustering
Initially, every instance represents an individual cluster
-> larger clusters emerge through agglomeration
Hierarchical clustering
Divisive clustering
Initially, the whole instance set is a single cluster
-> smaller clusters emerge through division
Offline vs. online clustering
Offline:
- a complete set of instances is provided before clustering starts
Online:
- instances are added iteratively, clustering can be started when at least three instances are available
Offline vs. online clustering
Online additional info
- the human brain continuously clusters “online”
- growing data sets, e.g. describing customer choices in online retailing, have to be clustered online
- repeating offline algorithms have to be repeated for growing data sets is computationally costly
- streaming clustering processes only the latest n instances
k-Means clustering Algorithm
Attributes
unique, flat, offline
k-Means clustering algorithm
Steps
- Let k be the number of desired clusters
- Randomly select k cluster centrpods
- assign all instances to the nearest cluster centroid
- calculate means for the attributes of instances in one cluster
- set means as new cluster centroids
- assign all instances to the nearest cluster centroid
- if 6. leads to the same results as 3.:Successful exit
Else: go to 4, calculate new means
k-Means Clustering algorithm
Improvements, variations, problem
Improvements:
Do not randomly choose initial centroids, but according to distance variance: k-means++
Variations:
Diverse distance measures depending on attribute types
Problem:
What is the best k?