MBBC (Model-Based Bayesian Clustering)
MBBC v2.0 is developed based on Booth et al. (2007). This program requires
pre-installation of Ox (installation guide in oxinstall.html, installation program
available at www.doornik.com/download
oxcons.html, and documentations at www.doornik.com/ox/)
and R (http://www.r-project.org/).
Installation of Acrobat Reader (http://www.adobe.com/products/acrobat/readstep2.html)
is not required, but recommended in order to see high quality pictures in MBBC
report and other related documents.
1. Input
arguments
- R program: MBBC will detect location of R.exe automatically.
If MBBC cannot R.exe in your computer, you may specify it by clicking
browse button in the end of this line.
- Ox program: MBBC will detect location of oxl.exe
automatically. If MBBC cannot oxl.exe in your computer, you may specify it
by clicking browse button in the end of this line.
- Output Directory: Users may specify a directory, in which
analysis results (picture files, memberships of genes in the optimal
cluster, report) are saved. As a default,
“…/MBBC2.0/output” is used.
- Data file: Data file should contain n
by r*p matrix, in which numbers are delimited by spaces. Here, n
is the number of genes, r is the number of replicates and p
is the number of time points. The first replicate set of gene expression
measures is in the first p columns, the second is in the next p
columns and so on.
- Number of Time Points,
Number of Genes, and Number of Replications: Once data file is
specified, MBBC specifies automatically number of Time Points, Number of
Genes, and Number of Replications. By default, MBBC sets Number of Time
Points to be the same as the number of rows and Number of Replications 1.
If the user adjusts Number of Time Points, Number of Replications will be
automatically calculated by dividing the number of rows by Number of Time
Points.
- Registration: Users
may specify how register each profile. Options are Centering,
Standardization, and No Registration.
- Number of Remaining Genes
after Filtering: To each profile, one-way ANOVA model is applied using
time as a predictor. Then, based
on F-values, MBBC chooses the most time-varying genes as many as
specified in Number of Remaining
Genes after Filtering. MCMC algorithm clusters these remaining gene
profiles.
- Number of Iterations:
You may specify the number of MCMC iterations that searches for the
optimal partition. In practice, MCMC algorithm may not find the optimal
partition within a reasonable time because many genes are considered for
clustering even after filtering.
However, MCMC algorithm can find a suboptimal partition that is
close enough from a practical perspective. See Booth et al. (2007) for
detailed discussion. To determine a proper number of iterations, it is
recommended to run MBBC with a small number of iterations and examine the
simulation history plots.
- Log(m): Tuning
parameter in Crowley's
prior, log(m). affects the number of clustering in the optimal partition;
the higher the value of log(m), the higher the number of clusters.
- Initial Partition:
Users may specify an initial partition using kmeans algorithm, uniformly
random partition, and a file that contains a column vector of group
identification number. In case when the kmeans algorithm is used, the
number of clusters must be specified. Users may choose a number from 1 to
the number of genes, or specify it with 10% of the number of genes.
- Smoothing Parameters:
Lambda1 and Lambda2 are a smoothing parameter for gene-specific and
gene-time-specific random effects. User can specify two smoothing
parameters or ask MBBC estimate them. Estimation algorithm is briefly introduced in Booth et al.
(2007) and details are in notes-lambda.pdf or note_lambda.png.
2. Buttons on MBBC window
- Load and View Data: Users must
load a data file by clicking this button before clicking “Search for
the Optimal Partition”.
- Filter Data: The user may reduce
the number of genes to be clustered by clicking this button.
- Search for the Optimal Partition: Once
this button is clicked, MCMC algorithm will run searching for the optimal
partition.
3. Plots
- Whole Data: After registering, a
plot of all profiles is drawn on View Data panel. Because all pictures are
saved as png files, users may edit plots easily with paintbrush
(mspaint.exe).
- Remaining Data after Filtering:
After filtering, a subset of profiles in Whole Data plot is shown.
- Simulation History:
a) Objective Function plot shows the values of the objective function for
each simulated partition. Users should examine if this simulation
converges in distribution.
b) Highest Value of Objective Function plot shows the highest values of
the objective function until a given number of first iterations. If the
curve approaches to a certain value and the value does not increase any
more for a large number of iterations, user may conclude that the optimal
or a suboptimal that is close enough to the optimal in a practical sense
is found.
c) Number of Clusters plot shows the number of clusters for each simulated
partition.
d) Number of Clusters plot is similar to Highest Value of Objective
Function plot except that the number of clusters is shown instead of
values of the objective function.
4. Report
- In
output directory, MBBC creates a report in pdf (report.pdf) and html
(report.html) files that contain the list of remaining genes after
filtering, lists of genes in each cluster, and plots.
5. Example
- MBBC
contains an example data set, which can be loaded by clicking
Example → Example1 (wound.txt): Wound.txt contains a 646 by 24
matrix, which contains expressions of 646 genes measured twice at 12 time
points.
Example → Example2 (GC_example.txt): GC_example.txt contains a 7640
by 4 matrix, which contains expressions of 7640 genes measured once at 4
time points.
Reference
- Joo, Y., Booth, J. , Namkoong, Y., and Casella, G. (2007) Model-Based
Bayesian Clustering (MBBC), Technical Report, University of Florida
- Booth,
J., Casella, G. and Hobert, J. Clustering Using Objective Functions and
Stochastic Search. To appear in Journal
of Royal Statistical Society B. Available to http://www.stat.ufl.edu/~casella/Papers/clustering07.pdf.