Treebic 1.0 - Software package for hierarchical biclustering. Source code for the C++ implementation of a Gibbs sampler.


Citing the package:
J. Caldas and S.Kaski.
Hierarchical Generative Biclustering for MicroRNA Expression Analysis.
Proceedings of the 14th International Conference on Research in Computational Molecular Biology (RECOMB), 2010.

Contact: Jos Caldas, jose.caldas@tkk.fi

=================
Table of contents
=================
1 - Preprocessing the data matrix.
2 - Compiling
3 - Running the software
3.1 - Command-line options
3.2 - Output
4 - Change log
=================

1 - Preprocessing the data matrix

Please perform gene-wise standardization before running treebic, in order to make the data more suitable to the model assumptions.

2 - Compiling

Type 'make' at the command prompt. Notice that treebic requires the GSL library.


3 - Running the software

See the example/ folder for an example of how to run treebic and how its output looks like.


3.1 - Command-line options

Treebic has the following *mandatory* command-line options:

--nClients - Number of clients in the model.
--nFeatures - Number of features in the model.
--treeDepth - The depth of the tree. A depth of one makes the tree have only a root node.
--burnIn - Number of iterations in the initial burn-in sampling phase.
--nIter - Number of iterations after the burn-in phase.
--nIterGamma - Number of Gibbs iterations to use when sampling gamma through an auxiliary variable scheme. 
Since sampling gamma is a fast process with comparison to the rest of the model, a safe, relatively high value such as 
100 can be used. 
--clientInitType - Way to initialize the tree structure. If 0, the tree is initialized by putting each client in its own unique leaf. If 1, the tree is initialized by sampling from the nested Chinese restaurant process prior. 
--alpha - Hyperparameter for the Beta prior related to edge lengths.
--beta - Hyperparameter for the Beta prior related to edge lengths.
--aGamma - Hyperparameter related to the Gamma prior for the random variable gamma.
--bGamma - Hyperparameter related to the Gamma prior for the random variable gamma.
--aV - Hyperparameter related to the Gamma prior for the variance variable in each group.
--bV - Hyperparameter related to the Gamma prior for the variance variable in each group.
--exprFn - Path to expression data filename. The expression filename should have a "feature" in each row and a "client" in each column (i.e. genes * conditions). It should also include solely the expression data.
--clientsFn - Path to client labels file. Should include one label per row.
--featuresFn - Path to feature labels file. Should include one label per row.
--resultsDir - Directory where other results files should be stored (see below).
--nFeatureScans - Number of times each feature is sampled in a single Gibbs iteration. Set to 5 in the original publication.


3.2 - Output

Treebic generates the following tab-delimited results files:

gamma.txt - Each column contains the full sampling process for gamma (the value at the last row of each column is the one that is retained for the 
next Gibbs sampler iteration). The number of columns equals the total number of Gibbs sampler iterations.
log_prob.txt - The log-probability of each part of the model (gamma, tree structure, features, and expression data), along with the number of nodes 
in the tree and the acceptance rate for both features and clients. Technically, the acceptance rate of a Gibbs sampler is always one when considering it as a particular example of a Metropolis-Hastings method. Here, "acceptance rate" means the fraction of sampled variables that were sampled to a value different from the value they had before.
tree_clients.txt - Maps each client to a leaf node.
tree_edges.txt - Each row maps a parent node to a leaf node and provides the corresponding estimated edge length.
tree_features.txt - Each row maps a feature to a node number, indicating that the feature switches from 0 to 1 in that node. Each feature may appear 
in more than one row, indicating that it switches from 0 to 1 in more than one node.


4 - Change log

Version 1.11, 2/9/10:
  - Added instructions to README file on how to preprocess the data matrix prior to running treebic. 

Version 1.1, 12/7/10: 
  - Removed inconsistencies on how to provide the input data matrix. The format should be features * clients, i.e. 
  genes * conditions.
