Multivariate multi-way analysis of multi-source data, 'multiWayCCÄ́' Copyright (C) 2010-2011 Tommi Suvitaival and Ilkka Huopaniemi LICENCE This file is part of multiWayCCA. multiWayCCA is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. multiWayCCA is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with multiWayCCA. If not, see . CITING If you use the software, please cite the following publication: Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkilä, Matej Oresic, and Samuel Kaski. Multivariate multi-way analysis of multi-source data. Bioinformatics, 26:i391–i398, 2010. (ISMB 2010). CONTACT INFORMATION Multivariate multi-way analysis of multi-source data, 'multiWayCCA'. Tommi Suvitaival, tommi.suvitaival@tkk.fi Ilkka Huopaniemi, ilkka.huopaniemi@tkk.fi Aalto University School of Science Department of Computer and Information Science Helsinki Institute for Information Technology HIIT http://research.ics.tkk.fi/mi/ DOCUMENTATION 02.02.2011 INTRODUCTION This package includes: -source code for the R implementation of the multi-way, multi-source analysis model -a script for carrying out the analysis Description for the model is available at http://bioinformatics.oxfordjournals.org/cgi/content/full/26/12/i391?ijkey=N4W0U0CjpBGUYzx&keytype=ref . This package is available at http://www.cis.hut.fi/projects/mi/software/multiWayCCA/ . The implementation is based on Arto Klami's Bayesian CCA implementation in R. Inference is done with Gibbs sampling. R platform is available at http://www.r-project.org/ . The package is tested with a Linux platform and R version 2.12.0. THE SCRIPT FILE 1-to-4-way, 2-source, analysis -script file multiWayCCA-110201-example.R Steps in preparing the package for new experiments 1. either provide your data in the format of the example, or write your own procedure for loading the data 2. set sampling parameters to suit your interests IMPORTING DATA The package assumes that the user provides a data set of two sources ('dataXfile', 'dataYfile') and a set of covariates ('covariatesFile'). The sources and covariates are assumed to have matched samples but the variables may be different across the sources. The current implementation provides a support for 1 to 4 covariates. The data is assumed to be provided by csv-formatted data files, where samples are arranged as columns and variables as rows. For the covariate file, the rows correspond to positive integer-valued covariates instead of normal variables. The number of columns is thus required to be equal for all the three data tables. The model assumes that the samples are matched and in the same order across all the three sources. The first row and column of each file are assumed to contain the sample and variable names, respectively. By default, the cells of the csv table are separated by character ',' and the decimal separator for the values is '.'. Toy data The package includes an example 2-way, 2-source, data set (files dataX.csv, dataY.csv and covariates.csv in the data subfolder). The example data contains 506 samples (columns), and 200 and 10 variables (rows) in the two sources X and Y, respectively. The first row and column of the data tables contain the sample and variable names, respectively. In this toy data, only covariates 'a' and 'b' have values above 1, which in effect means that only those two covariates are active. As all samples have value 1 for covariates 'c' and 'd', the effects of those covariates are not estimated in the example analysis. User's data The user may provide own data sets by defining the names of the data files to be imported in the script ('dataXfile', 'dataYfile' and 'covariatesFile'). Covariates may have positive integer values (1, 2, ...). The base level for the covariate is 1, and on this level no covariate effects are estimated. In the model, other levels are compared to this base level. In the results, covariates are named 'a', 'b', 'c' and 'd'. If the covariate data does not include as many as four covariates, the user should still provide the missing covariates with values 1 set for all samples (see the toy data for an example). If the user does not want to provide the names of samples and/or variables, this can be taken into account by setting the arguments 'header=FALSE' (samples) and 'row.names=NULL' (variables) of the function 'read.table()' in the script. PARAMETERS In the script file, the user may set the following parameters: 'path' defines the root folder of the package. The package will not work unless the user sets correctly. When the path is set correctly, the source code R files will be found from the subfolder 'path/sourcecode/'. 'runId' is meant to be a unique name for each analysis. Results of the analysis are saved into a folder 'path/results/runId'. The script file is assumed to be named 'runid.R' and be found at 'path/scripts/'. The user is recommended to follow this convention, as then the script file will be correctly copied to the results directory for later reference. 'dataXfile' and 'dataYfile' are the names of the files for sources X and Y. 'covariatesFile' is the name of the file for the covariates. The script assumes that the files are found in subfolder 'path/data/'. 'NburnIn' is the length of the burn-in phase for the MCMC chain. 'Niterfinal' is the number of Gibbs samples to be drawn after the burn-in for estimating the underlying posterior distribution. 'nXlat' and 'nYlat' are the numbers of view-specific latent components. The data variables will be clustered into a number of clusters defined by these values. However, clustering is optional for the model. The user can switch clustering off for either source by setting the corresponding parameter value to 'NA' (without 's). It should be kept in mind, though, that the dimensionality reduction, utilized here by clustering, is essential for the analysis of high-dimensional data. Clustering should be switched off only, if the number of variables in the source is considerably smaller than the number of samples. 'takeLog' decides whether a log transform is computed for the data sources X and/or Y before the actual analysis ('TRUE'/'FALSE'). 'doPlotting' decides whether the results are saved both numerically and visually ('TRUE') or whether only numerical samples are saved into an RData file ('FALSE'). 'sampleEff' is a list of parameters which determine which of the ANOVA-type covariate effects are estimated by the model ('TRUE'/'FALSE'). 'A' is the main effect of covariate 'a', 'AB' the interaction effect of covariates 'a' and 'b', 'ABC' the interaction effect of covariates 'a', 'b' and 'c', etc. In the case of missing covariates, the corresponding effects are automatically left out from the model. PACKAGE AND DATA LOADING After the parameters section in the script, the source code files of the package are loaded into R using the 'loadSource()' function. This should not be edited by the user. The data files are imported using the standard 'read.table()' function of R. If the user needs to change this procedure for importing the data, it can be done by replacing the function calls of 'read.table()'. SAMPLING The last line of the script performs the call for the main function of the package. The data and all the above listed parameters are provided as arguments to the function. First stage of sampling is the burn-in, which ensures that the sampler converges to the proper posterior distribution. The user should pay attention to allowing long enough burn-in. Burn-in is followed by additional sampling from the posterior distribution. The posterior samples are returned by the function and in the script saved into the list 'result'. The result is also saved into the .Rdata file 'path/results/runId/results.RData', and can be found in the list with the name 'posterior'. RESULTS Results of the analysis are written into subfolder 'path/results/runId/', if the parameter 'doPlotting' is set to value 'TRUE'. Each analysis produces figures, of which most important are the series of estimated covariate effects (eff-...png), projections onto source-specific latent variables (Wx.png and Wy.png), and the lists of cluster assignments for the variables (Vx.txt and Vy.txt). For the effect plots, the type of the effect is encoded into the file name. For instance, 'eff-ABC-a2b3.png' is the interaction effect of covariates 'a', 'b' and 'c', when 'a' and 'b' are fixed to levels 2 and 3, respectively. The interaction effect is then plotted across all levels of covariate 'c'. When any of the covariates has value 1, the effect is automatically at value zero (base line). Also these effects are plotted, though, to give the user a clear picture of which effects are estimated and which are not. The plots provide a 95 % posterior interval for the latent variables of the model. The lines in a boxplot in order from bottom to top are the 2.5, 25, 50, 75 and 97.5 % quantiles of the posterior distribution estimated by the sampler. When the clustering option is on, 'plots X-cor-...png' and 'Y-cor-...png' provide correlation matrices of variables in the clusters, computed using samples in a category encoded in the file name. For instance, 'X-cor-a2b3c2.png' shows the correlation matrices computed using samples that have covariate levels 'a=2', 'b=3' and 'c=2'. The plots also show the number of variables in each cluster and the average correlation within the clusters over the sampling procedure.