Clustering


Problem Description

The clustering problem reads as follows: given a dataset of points in some d-dimensional Euclidean space, partition the data into K clusters such that some empirical loss function (performance measure) is minimized. A typical loss function is the sum of the squared distances between the points and their respective cluster centres. Clustering problems are described in Section 8.3 of [7].


Programs

NCE.m
This program finds a set of cluster means via the CE method with (independent) Normal updating.
Usage:
Call the program from MATLAB, with the following syntax:

[mu,count,score]=NCE(N,g,alpha,k,data,modif,drplot,c,sigma0)

Example: [mu,count,score]=NCE(1000,10,0.7,5,DATA,0,1)
Inputs:    
N - Number of samples to generate each iteration
g - Number of these samples to use to update parameters
alpha - Smoothing parameter
k - Number of cluster means to find
data - The data we are trying to fit means to (Should be n x d,
    where there are n points, and d dimensions)
modif - If 1, use modified smoothing,
    otherwise use standard smoothing
drplot - If 1, draws the cluster means
    and the data (for 2-dimensions)
c - Optional starting centroids
sigma0 - Optional starting standard deviation
     
Outputs:    
mu - The centroids found via the CE method, using Normal updating
    with the parameter set
count - The number of iterations taken
score - The final score of these centroids
genNCE.m This program is used internally to generate the cluster means.
scoreNCE.m This program is used internally to evaluate the performance of a particular set of means against the data.
MCE.m
This program finds a set of cluster means via the CE, looking at clustering as a ``mincut'' type problem.
Usage:
Call the program from MATLAB, with the following syntax:

x=MCE(N,rho,alpha,k,data)

Example: x=MCE(2000,0.01,0.6,3,DATA)
Inputs:    
N - Number of samples to generate each iteration
rho - The fraction of samples used to update the probabilities
alpha - Smoothing parameter
k - Number of clusters to assign points to
data - The data we are trying to assign to clusters (Should be n x d,
    where there are n points, and d dimensions)
     
Outputs:    
x - The best found assignment of the data points
genMCE.m This program is used internally to assign data points to clusters.

MCEJ.m
This is slight modification of the above program, using the ``injection'' idea (due to Zdravko Botev). It generally produces superior results to the unmodified MCE method.
Usage:
Call the program from MATLAB, with the following syntax:

x=MCEJ(N,rho,alpha,k,data)

Example: x=MCEJ(1400,0.03,0.75,4,DATA)
Inputs:    
N - Number of samples to generate each iteration
rho - The fraction of samples used to update the probabilities
alpha - Smoothing parameter
k - Number of clusters to assign points to
data - The data we are trying to assign to clusters (Should be n x d,
    where there are n points, and d dimensions)
     
Outputs:    
x - The best found assignment of the data points

scoreMCE.m This program is used internally to evaluate the performance of a particular assignment against the data.
clNCE.m This is a small program which labels the points in a dataset according to a set of cluster means.
Usage:
Call the program from MATLAB, with the following syntax:

x=clNCE(c,data,k)

Example: x=clNCE(mu,DATA,5)
Inputs:    
c - A set of cluster means
data - The data we are trying to assign to clusters (Should be n x d,
    where there are n points, and d dimensions)
k - Number of clusters to assign points to
     
Outputs:    
x - The assignment of the data points to clusters
cMCE.m This is a small program which calculates cluster means from a given cluster labelling.
Usage:
Call the program from MATLAB, with the following syntax:

c=cMCE(x,y,k)

Example: mu=cMCE(x,data,6)
Inputs:    
x - A labelling of data points
data - The data we are trying to fit means to (Should be n x d,
    where there are n points, and d dimensions)
k - Number of cluster means
     
Outputs:    
x - The cluster means calculated for this labelling of data points

imim.mat - this is the image data file : it contains 2 variables, I : the 20 by 20 subimage (ie the actual image), Ib: the 256 by 25 set of all 5x5 subimages of I.

NCEICJ.m - nce with component upd. and injection

clim.m - create all pxp subimages from a given image (ie to generate Ib, can use this)

mu2im.m - converts the K cluster means back into pxp matrices so its easier to plot.



cetoolbox www user
2004-12-17