drCCA software package

drCCA: dimensionality reduction with Canonical Correlation Analysis

Brief summary: drCCA is a simple, efficient, and completely linear data fusion tool based on canonical correlation analysis.

Integration of multiple information sources is an increasingly important task in bioinformatics applications. Understanding, predicting or already efficiently exploring the cellular mechanisms requires information from several sources, such as gene expression, protein concentrations or transcription factor binding. Combining data sources, that can in general have very different forms of representations, is a non-trivial problem, but already partial solutions are useful. Combining several sources is advantageous also when using just one type of data, because it reduces the noise which is often a significant issue in biological experiments that have high dimensionality but relatively few samples.

We offer a simple tool for combining several data sources with co-occurring samples into a one vectorial data set of low dimensionality. The method is motivated through bioinformatics applications, but is generally usable for data fusion tasks in other fields as well. The method aims to retain the variation that is shared between the original data sources, while reducing the dimensionality by ignoring variation that is specific to any of the data sources alone. It is assumed that such variation is either noise or at least less interesting as it is related to a phenomenom not visible in the other sources, despite those containing measurements of the exact same objects.

The drCCA method is based on utilizing the generalized canonical correlation analysis to perform a linear projection on the collection data sets. As the method is completely linear it is fast to compute for large data sets, making genome-wide fusion possible. The package includes regularization and tools for selecting the final dimensionality of the combined data set automatically.

Publication:

More information on the algorithm can be found in the following publication:

Abhishek Tripathi, Arto Klami and Samuel Kaski. Simple integrative preprocessing preserves what is shared in data sources. BMC Bioinformatics, 2008,9:111. (Open Access: html, pdf)

If you use the package, please cite the above paper.

Documentation:

You can read the html documentation included in the package.

Package:

The package is nowadays included in the Dependency modeling toolkit. That is the recommended way of obtaining the code. The below links are provided for backwards compability.

Copyright

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

Downloading and installing

Download drCCA 1.0[drCCA_1.0.tar.gz].
Good instructions on how to install packages in R can be found in R documentation. The latest version or R is recommended, but the package should work also with older version.

Support

If you have any comments or bug reports on the package, contact Abhishek Tripathi.

Probabilistic Machine Learning