README for EMMIX-GENE ===================== (This software is available freely for NON-COMMERCIAL use only.) EMMIX-GENE consists of three programs, "select-genes", "cluster-genes" and "cluster-tissues". All are provided as executable files for Linux. Each program takes three arguments on the command-line: the filename of the microarray data (which is in the form of a N x M matrix, where the rows correspond to the N genes and the columns to the M tissues), N (the number of rows), and M (the number of columns). In the following, we describe for each program the input the user is required to provide, the function of the program, and the format of the output files. select-genes ------------ User-supplied input: * Number of random and k-means starts for the fitting of mixtures of t components to individual genes * Three random seeds * Thresholds for the likelihood ratio statistic and minimum cluster size If the input file contained the microarray data is called "mdata", then the output files are called "mdata.cut" and "mdata.stats". mdata.cut is a G x M array where the rows correspond to the G selected genes, and mdata.stats is a G x 4 array where each row contains three columns. The first column is the row number of the selected gene from mdata, the second column contains the likelihood ratio statistic -2 log \lambda for that gene, and the third column contains the size of the smaller group when fitting one t-distribution against two. cluster-genes ------------- User-supplied input: * Number of random and k-means starts for the fitting of mixtures of normal components to group means * Number of random and k-means starts for the fitting of common spherical components to the selected genes * Number of groups into which to cluster the selected genes * Three random seeds * Option to reorder the tissues as clustered on the basis of the fitted group means (yes or no) If N_0=20 and the input filename containing the microarray data is called "mdata", the output files of "cluster-genes" are called "input.group0, input.group1, ..., input.group19". These files contain the 20 groups sorted in decreasing order of -2 log \lambda for the corresponding fitted group means. That is, input.group0 contains the genes for which the fitted group mean is highest. There are also corresponding files "input.list0, ..., input.list19" containing the numbering of the genes (as in the input file) which are placed in each group. The program "cluster-genes" also outputs "input.groupmeans" which contains the fitted group means, sorted in decreasing order of -2 log \lambda. Also, a file "input.gstats" is output, which is a G x 3 array. The first column contains the group number, the second column is the number of genes in the group, and the third column is the likelihood ratio statistic, -2 log \lambda, for the corresponding fitted group mean. Usually, the input for "cluster-genes" will be the ".cut" file produced by "select-genes." The user can specify that the number of rows used, N_u, is less than the actual number of rows in the input file, in which case only the first N_u rows are read from the input file. cluster-tissues --------------- User-supplied input: * Number of factors * Number of components * Number of random and k-means starts * Three random seeds The program "cluster-tissues" clusters the columns of the microarray data matrix, and outputs this clustering to the screen. Usually, the input for "cluster-tissues" will be the ".cut", ".groupmeans" or ".group*" files produced by "select-genes" and "cluster-genes." In a similar manner to cluster-genes, the user can specify that the number of rows used, N_u, is less than the actual number of rows in the input file, in which case only the first N_u rows are read from the input file.