README for EMMIX-GENE
=====================

(This software is available freely for NON-COMMERCIAL use only.)

EMMIX-GENE consists of three programs, "select-genes", "cluster-genes"
and "cluster-tissues".  All are provided as executable files for Linux.

Each program takes three arguments on the command-line: the filename
of the microarray data (which is in the form of a N x M matrix, where
the rows correspond to the N genes and the columns to the M tissues),
N (the number of rows), and M (the number of columns).

In the following, we describe for each program the input the user is
required to provide, the function of the program, and the format of the
output files.

select-genes
------------

User-supplied input: 
* Number of random and k-means starts for the fitting of mixtures 
of t components to individual genes
* Three random seeds
* Thresholds for the likelihood ratio statistic and minimum cluster size

If the input file contained the microarray data is called "mdata",
then the output files are called "mdata.cut" and "mdata.stats".
mdata.cut is a G x M array where the rows correspond to the G selected
genes, and mdata.stats is a G x 4 array where each row contains three columns.
The first column is the row number of the selected gene from mdata,
the second column contains the likelihood ratio statistic -2 log \lambda
for that gene, and the third column contains the size of the smaller group
when fitting one t-distribution against two.

cluster-genes
-------------

User-supplied input:
* Number of random and k-means starts for the fitting of mixtures 
of normal components to group means
* Number of random and k-means starts for the fitting of common
spherical components to the selected genes
* Number of groups into which to cluster the selected genes
* Three random seeds
* Option to reorder the tissues as clustered on the basis of the fitted
  group means (yes or no)

If N_0=20 and the input filename containing the microarray data is called
"mdata", the output files of "cluster-genes" are called "input.group0,
input.group1, ..., input.group19".  These files contain the 20 groups
sorted in decreasing order of -2 log \lambda for the corresponding fitted
group means.  That is, input.group0 contains the genes for which the fitted
group mean is highest.  There are also corresponding files "input.list0,
..., input.list19" containing the numbering of the genes (as in the input file)
which are placed in each group.

The program "cluster-genes" also outputs "input.groupmeans"
which contains the fitted group means, sorted in decreasing order of
-2 log \lambda.  

Also, a file "input.gstats" is output, which is a G x 3 array.  
The first column contains the group number, the second column
is the number of genes in the group, and the third column is the likelihood
ratio statistic, -2 log \lambda, for the corresponding fitted group
mean. 

Usually, the input for "cluster-genes" will be the ".cut" file produced 
by "select-genes."  The user can specify that the number of rows used, N_u, 
is less than the actual number of rows in the input file, in which case 
only the first N_u rows are read from the input file.

cluster-tissues
---------------

User-supplied input:
* Number of factors
* Number of components
* Number of random and k-means starts
* Three random seeds

The program "cluster-tissues" clusters the columns of the microarray data
matrix, and outputs this clustering to the screen.

Usually, the input for "cluster-tissues" will be the ".cut", ".groupmeans"
or ".group*" files produced by "select-genes" and "cluster-genes."  In a
similar manner to cluster-genes, the user can specify that the number of
rows used, N_u, is less than the actual number of rows in the input file,
in which case only the first N_u rows are read from the input file.