Machine Learning Flashcards
Inductive learning
generalize from a given set of (training) examples so that accurate predictions can be made about future examples; learn an unknown function
how to represent a “thing” in machine learning
x: example or instance of a specific object; represented by a feature vector; each dimension - feature
feature vector representation
extract a feature vector x, that describes all attribute relevant for an object; each x is a list of (attribute, value) pairs
types of features
numerical features - discrete or continuous
categorical features - no intrinsic ordering
ordinal features - similar to categorical but clear ordering
point in feature vector representation
each example can be interpreted as a point in a D-dimensional feature space, where D is the number of features/attributes
Training set
A training set is a collection of examples (instances), which is the input to the learning process; assume instances are independent and identically distributed. training set = experience given to learning algorithm
idd
independent and identically distributed
Unsupervised learning
training set = x1, …xn; no “teacher” to show how examples should be handled; tasks: clustering, discovery, novelty detection; dimensionality reduction
goal of clustering
group training samples into clusters such that examples in the same cluster are similar, and examples in different clusters are different
Clustering methods
Hierarchical Agglomerative Clustering
K-means Clustering
Mean Shift Clustering
Hierachical Clustering General Idea
initially every point is in its own cluster
find the pair of clusters that are the closest
merge the two into a single cluster
repeat
end result: binary tree
How to measure closeness between 2 clusters (hierarchical clustering)
Single linkage, complete-linkage, average linkage
single-linkage
the shortest distance from any member of 1 cluster to any member of another cluster
complete linkage
the largest distance from any member of 1 cluster to any member of another cluster
average linkage
the average distance between all pairs of members, one from each cluster
How to measure the distance between a pair of examples?
Euclidean, manhattan/city block, hamming
Dendrogram
binary tree resulting from hierarchical clustering; the tree can be cut at any level to produce different numbers of clusters
What factors affect the outcome of hierarchical clustering?
features used, range of values for each feature, linkage method, distance metric used, weight of each feature
K-Means Clustering
Specify the desired number of clusters and use an iterative algorithm to find them
K-means clustering general idea
if have cluster centers, for each point, choose closest center, if have points, choose a cluster center to be the mean/centroid of the points in the cluster; repeat until convergence
K-means algorithm
input: x…xn
select k cluster centers c1, …ck
fore each point x, determine it’s cluster by finding closest cluster center; update cluster centers as centroid (mean)
Distortion
Sum of squared distances of each data point to its cluster center; optimal clustering minimizes distortion (over all possible possible cluster locations/assignments)
How to pick k in k-means clustering
pick number of k to minimize distortion - pick one close to elbow of the distortion curve
Does K-means always terminate?
Yes (finite number of ways of partioning a finite number of points into k groups)