Clustering

Problem Description

The clustering problem reads as follows: given a dataset of points in some d-dimensional Euclidean space, partition the data into K clusters such that some empirical loss function (performance measure) is minimized. A typical loss function is the sum of the squared distances between the points and their respective cluster centres. Clustering problems are described in Section 8.3 of [7].

Programs

: NCE.m
This program finds a set of cluster means via the CE method with (independent) Normal updating.

Usage:
Call the program from MATLAB, with the following syntax:

[mu,count,score]=NCE(N,g,alpha,k,data,modif,drplot,c,sigma0)

Example: [mu,count,score]=NCE(1000,10,0.7,5,DATA,0,1)

Inputs:
N	-	Number of samples to generate each iteration
g	-	Number of these samples to use to update parameters
alpha	-	Smoothing parameter
k	-	Number of cluster means to find
data	-	The data we are trying to fit means to (Should be n x d,
		where there are n points, and d dimensions)
modif	-	If 1, use modified smoothing,
		otherwise use standard smoothing
drplot	-	If 1, draws the cluster means
		and the data (for 2-dimensions)
c	-	Optional starting centroids
sigma0	-	Optional starting standard deviation

Outputs:
mu	-	The centroids found via the CE method, using Normal updating
		with the parameter set
count	-	The number of iterations taken
score	-	The final score of these centroids

: genNCE.m This program is used internally to generate the cluster means.

: scoreNCE.m This program is used internally to evaluate the performance of a particular set of means against the data.

: MCE.m
This program finds a set of cluster means via the CE, looking at clustering as a ``mincut'' type problem.

Usage:
Call the program from MATLAB, with the following syntax:

x=MCE(N,rho,alpha,k,data)

Example: x=MCE(2000,0.01,0.6,3,DATA)

Inputs:
N	-	Number of samples to generate each iteration
rho	-	The fraction of samples used to update the probabilities
alpha	-	Smoothing parameter
k	-	Number of clusters to assign points to
data	-	The data we are trying to assign to clusters (Should be n x d,
		where there are n points, and d dimensions)

Outputs:
x	-	The best found assignment of the data points

: genMCE.m This program is used internally to assign data points to clusters.

: MCEJ.m
This is slight modification of the above program, using the ``injection'' idea (due to Zdravko Botev). It generally produces superior results to the unmodified MCE method.

Usage:
Call the program from MATLAB, with the following syntax:

x=MCEJ(N,rho,alpha,k,data)

Example: x=MCEJ(1400,0.03,0.75,4,DATA)

Inputs:
N	-	Number of samples to generate each iteration
rho	-	The fraction of samples used to update the probabilities
alpha	-	Smoothing parameter
k	-	Number of clusters to assign points to
data	-	The data we are trying to assign to clusters (Should be n x d,
		where there are n points, and d dimensions)

Outputs:
x	-	The best found assignment of the data points

: scoreMCE.m This program is used internally to evaluate the performance of a particular assignment against the data.

: clNCE.m This is a small program which labels the points in a dataset according to a set of cluster means.

Usage:
Call the program from MATLAB, with the following syntax:

x=clNCE(c,data,k)

Example: x=clNCE(mu,DATA,5)

Inputs:
c	-	A set of cluster means
data	-	The data we are trying to assign to clusters (Should be n x d,
		where there are n points, and d dimensions)
k	-	Number of clusters to assign points to

Outputs:
x	-	The assignment of the data points to clusters

: cMCE.m This is a small program which calculates cluster means from a given cluster labelling.

Usage:
Call the program from MATLAB, with the following syntax:

c=cMCE(x,y,k)

Example: mu=cMCE(x,data,6)

Inputs:
x	-	A labelling of data points
data	-	The data we are trying to fit means to (Should be n x d,
		where there are n points, and d dimensions)
k	-	Number of cluster means

Outputs:
x	-	The cluster means calculated for this labelling of data points

imim.mat - this is the image data file : it contains 2 variables, I : the 20 by 20 subimage (ie the actual image), Ib: the 256 by 25 set of all 5x5 subimages of I.

NCEICJ.m - nce with component upd. and injection

clim.m - create all pxp subimages from a given image (ie to generate Ib, can use this)

mu2im.m - converts the K cluster means back into pxp matrices so its easier to plot.

cetoolbox www user
2004-12-17