MBBC (Model-Based Bayesian Clustering)


MBBC v2.0 is developed based on Booth et al. (2007). This program requires pre-installation of Ox (installation guide in oxinstall.html, installation program available at www.doornik.com/download oxcons.html, and documentations at www.doornik.com/ox/) and R (http://www.r-project.org/). Installation of Acrobat Reader (http://www.adobe.com/products/acrobat/readstep2.html) is not required, but recommended in order to see high quality pictures in MBBC report and other related documents.


1. Input arguments

  • R program: MBBC will detect location of R.exe automatically. If MBBC cannot R.exe in your computer, you may specify it by clicking browse button in the end of this line.

  • Ox program: MBBC will detect location of oxl.exe automatically. If MBBC cannot oxl.exe in your computer, you may specify it by clicking browse button in the end of this line.

  • Output Directory: Users may specify a directory, in which analysis results (picture files, memberships of genes in the optimal cluster, report) are saved. As a default, “…/MBBC2.0/output” is used.

  • Data file: Data file should contain n by r*p matrix, in which numbers are delimited by spaces. Here, n is the number of genes, r is the number of replicates and p is the number of time points. The first replicate set of gene expression measures is in the first p columns, the second is in the next p columns and so on.

  • Number of Time Points, Number of Genes, and Number of Replications: Once data file is specified, MBBC specifies automatically number of Time Points, Number of Genes, and Number of Replications. By default, MBBC sets Number of Time Points to be the same as the number of rows and Number of Replications 1. If the user adjusts Number of Time Points, Number of Replications will be automatically calculated by dividing the number of rows by Number of Time Points.

  • Registration: Users may specify how register each profile. Options are Centering, Standardization, and No Registration.

  • Number of Remaining Genes after Filtering: To each profile, one-way ANOVA model is applied using time as a predictor. Then, based on F-values, MBBC chooses the most time-varying genes as many as specified in Number of Remaining Genes after Filtering. MCMC algorithm clusters these remaining gene profiles.

  • Number of Iterations: You may specify the number of MCMC iterations that searches for the optimal partition. In practice, MCMC algorithm may not find the optimal partition within a reasonable time because many genes are considered for clustering even after filtering.  However, MCMC algorithm can find a suboptimal partition that is close enough from a practical perspective. See Booth et al. (2007) for detailed discussion. To determine a proper number of iterations, it is recommended to run MBBC with a small number of iterations and examine the simulation history plots.  

  • Log(m): Tuning parameter in Crowley's prior, log(m). affects the number of clustering in the optimal partition; the higher the value of log(m), the higher the number of clusters.

  • Initial Partition: Users may specify an initial partition using kmeans algorithm, uniformly random partition, and a file that contains a column vector of group identification number. In case when the kmeans algorithm is used, the number of clusters must be specified. Users may choose a number from 1 to the number of genes, or specify it with 10% of the number of genes.

  • Smoothing Parameters: Lambda1 and Lambda2 are a smoothing parameter for gene-specific and gene-time-specific random effects. User can specify two smoothing parameters or ask MBBC estimate them. Estimation algorithm is briefly introduced in Booth et al. (2007) and details are in notes-lambda.pdf or note_lambda.png.

2. Buttons on MBBC window

  • Load and View Data: Users must load a data file by clicking this button before clicking “Search for the Optimal Partition”. 

 

  • Filter Data: The user may reduce the number of genes to be clustered by clicking this button.

  • Search for the Optimal Partition: Once this button is clicked, MCMC algorithm will run searching for the optimal partition.

3. Plots

  • Whole Data: After registering, a plot of all profiles is drawn on View Data panel. Because all pictures are saved as png files, users may edit plots easily with paintbrush (mspaint.exe).

  • Remaining Data after Filtering: After filtering, a subset of profiles in Whole Data plot is shown.

  • Simulation History:
    a) Objective Function plot shows the values of the objective function for each simulated partition. Users should examine if this simulation converges in distribution.
    b) Highest Value of Objective Function plot shows the highest values of the objective function until a given number of first iterations. If the curve approaches to a certain value and the value does not increase any more for a large number of iterations, user may conclude that the optimal or a suboptimal that is close enough to the optimal in a practical sense is found.
    c) Number of Clusters plot shows the number of clusters for each simulated partition. 
    d) Number of Clusters plot is similar to Highest Value of Objective Function plot except that the number of clusters is shown instead of values of the objective function.

 

4. Report

  • In output directory, MBBC creates a report in pdf (report.pdf) and html (report.html) files that contain the list of remaining genes after filtering, lists of genes in each cluster, and plots.

 

5. Example

  • MBBC contains an example data set, which can be loaded by clicking
    Example → Example1 (wound.txt): Wound.txt contains a 646 by 24 matrix, which contains expressions of 646 genes measured twice at 12 time points.
    Example → Example2 (GC_example.txt): GC_example.txt contains a 7640 by 4 matrix, which contains expressions of 7640 genes measured once at 4 time points.


 
Reference