PeakANOVA: Stronger findings from mass spectral data through multi-peak modeling Readme file Copyright 2013 Tommi Suvitaival Email: tommi.suvitaival@aalto.fi This file is part of PeakANOVA. PeakANOVA is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. PeakANOVA is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with PeakANOVA. If not, see . CITING If you use this software, please cite the following publication: Tommi Suvitaival, Simon Rogers, and Samuel Kaski. Stronger findings from mass spectral data through multi-peak modeling. Submitted. CONTACT INFORMATION Tommi Suvitaival tommi.suvitaival@aalto.fi Helsinki Institute for Information Technology HIIT Department of Information and Computer Science Aalto University Simon Rogers simon.rogers@glasgow.ac.uk School of Computing Science University of Glasgow Samuel Kaski samuel.kaski@aalto.fi Helsinki Institute for Information Technology HIIT Department of Information and Computer Science Aalto University and Helsinki Institute for Information Technology HIIT Department of Computer Science University of Helsinki http://research.ics.aalto.fi/mi/ DOCUMENTATION INTRODUCTION This package includes: -Source code for the R implementation of the PeakANOVA model -An example script for generating and analysing simulated data The R software environment is available at http://www.r-project.org/ . The package is tested with a Linux PC running an R version 3.0.1. EXAMPLE SCRIPT File name: 'peakANOVA-example_script.R' Description -An example script for running the PeakANOVA analysis. Functionality 1) Load the source code files 2) Set the default parameters of the model 3) Generate simulated data 4) Normalize the intensity data 5) Infer the clusters 6) Infer the covariate effects on the clusters 7) Compute an alternative approach 8) Print results Instructions -In the script, set 'path.source' to match the file path of the package source code. -After setting the path correctly, the script will run automatically producing the example results with default settings. FUNCTIONS Introduction -The functions are listed in the order of appearance at the example script file. loadSourcePeakAnova() Description -A function for loading the source files of the package. Arguments -path: File path of the source code. Value -TRUE if successful. getDefaultParamPeakAnova() Description -A function for generating the default parameter setting for the PeakANOVA model. Arguments -None Value -param: A list of parameters for the PeakANOVA model. generateSimulatedDataPeakAnova() Description -A function for generating simulated data for the PeakANOVA model. Arguments -effects.covariate.a: Effects of covariate 'a'. A vector of length K with real values, where K is the number of clusters (compounds). The length will determine the number of clusters in the generated data. -N.variables.per.cluster: Number of variables (peaks) in each cluster. A positive integer value. -N.samples.per.category: Number of samples in each ANOVA category (i.e., the number of samples that share the same level of covariate 'a'). An positive integer value. -sigma: Noise variance level. A positive real value. -p.spike.gen.inside: Likelihood of a missing value in a peak shape correlation matrix, for a pair of peaks in the same cluster. A real value between 0 and 1. -p.spike.gen.outside: Likelihood of a missing value in a peak shape correlation matrix, for a pair of peaks in different clusters. A real value between 0 and 1. -shapes.beta.gen.inside: Parameters of the beta distribution that defines the likelihood of an observed value in a peak shape correlation matrix, for a pair of peaks in the same cluster. A vector of length 2 with non-negative real values. -shapes.beta.gen.outside: Parameters of the beta distribution that defines the likelihood of an observed value in a peak shape correlation matrix, for a pair of peaks in different clusters. A vector of length 2 with non-negative real values. Value -covariates: Covariate levels of samples. A list of vectors 'a', 'b' and 'c' corresponding to covariates with the same names. Each vector is of length N with positive integer values, where N is the total number of samples. -Q: Peak shape correlations data. An array with of real values between -1 and 1 or missing values (NA). The array has dimensions NxPxP, where N is the total number of samples and P is the total number of variables (peaks). -Q.dbeta.log: Logarithmic likelihood of observed peak shape correlations data. A list of two matrices 'inside' and 'outside' containing the log-likelihoods of peak shape correlations in the same and in different clusters, respectively, summed over all samples. Both matrices are real-valued and are of dimensionality PxP, where P is the total number of variables (peaks). -V.true: Ground-truth clustering of variables (peaks). A matrix with values 0 and 1 with one non-zero value on each row indicating the cluster assignment. The dimensionality of the matrix is PxK, where P is the total number of variables and K is the number of clusters. -X: The intensity (peak height) data. A matrix with dimensions PxN, where P is the total number of variables and N is the total number of samples. normalizeDataByControlPopulation() Description -A function for normalizing (i.e., Z-transforming) the intensity data based on the control samples Arguments -X: The intensity (peak height) data. A matrix of real values or missing values with dimensionality PxN, where P is the total number of variables and N is the total number of samples. -covariates: Covariate levels of samples. A list of vectors 'a', 'b' and 'c' corresponding to covariates with the same names. Each vector is of length N with positive integer values, where N is the total number of samples. Control samples are defined as the samples fulfilling the condition 'a==1', 'b==1' and 'c==1'. -log.transform: Indicator on whether X will be transformed into logarithmic space before the normalization. A value 'FALSE', or a positive real value indicating the base of the logarithm (as in log(x,base)). -zero.mean: Indicator on whether the variable-specific mean of the control population will or will not be subtracted from the data or not ('TRUE' or 'FALSE', respectively). -unit.scale: Indicator on whether the data will or will not be divided with the variable-specific scale (variance) of the control population ('TRUE' or 'FALSE', respectively). Value -normalization: Normalization parameter vectors mean and scale. Both vectors are of length P, where P is the total number of variables. -X: The normalized intensity data. A matrix with same dimensionality as the argument X. clusterPeakAnova() Description -A function for clustering variables (peaks) by their similarity in shapes. Arguments -Q: Peak shape correlations data. An array with of real values between -1 and 1 or missing values (NA). The array has dimensions NxPxP, where N is the total number of samples and P is the total number of variables (peaks). Provide either Q.dbeta.log or Q. -Q.dbeta.log: Logarithmic likelihood of observed peak shape correlations data. A list of two matrices 'inside' and 'outside' containing the log-likelihoods of peak shape correlations in the same and in different clusters, respectively, summed over all samples. Both matrices are real-valued and are of dimensionality PxP, where P is the total number of variables (peaks). param: A list of parameters created by the function generateSimulatedDataPeakAnova(). Provide either Q.dbeta.log or Q. Value -V.ls: The least-squares clustering computed over the Gibbs samples. A matrix with values 0 and 1 with one non-zero value on each row indicating the cluster assignment. The dimensionality of the matrix is PxK, where P is the total number of variables and K is the inferred number of clusters. -V.vec: Gibbs samples of the clustering. A matrix with dimensionality SxP, where S is the number of Gibbs samples and P is the total number of variables. Each row 's' of the matrix is an indicator vector of cluster assignments of the P variables. Returned only if 'param$saveVvec=TRUE'. -association.avg: Association matrix of the clustered variables computed as an average over the Gibbs samples. A matrix of real values between 0 and 1, indicating the probability of association of pairs of variables. The dimensionality of the matrix is PxP, where P is the total number of variables. multiWayDR() Description -A function for inferring covariate effects from the intensity data given a clustering of variables (peaks). Arguments -data: A list of intensity data and covariates. 'X' is the intensity data matrix of real values with dimensionality PxN, where P is the total number of variables and N is the total number of samples. 'covariates' is a list of covariate indicator vectors 'a', 'b' and 'c', each with length of N. The vectors have positive integer values, and each value is matched to a column with the same index in 'X'. -param: A list of parameters created by the function generateSimulatedDataPeakAnova(). Value -posterior: A list of Gibbs samples from the model. -eff: A list of inferred covariate effects with arrays 'A', 'B' and 'C' corresponding to the covariates 'a', 'b' and 'c' (each returned if the corresponding covariate has values above 1 and if param$sampleEff${A,B,C}=TRUE). The arrays have dimensionality SxKxL{a,b,c}, where S is the number of Gibbs samples, K is the number of clusters and L{a,b,c} are the numbers of covariate levels of covariates 'a', 'b' and 'c', respectively. Additionally, interaction effects 'AB', 'AC', 'BC' and 'ABC' may be returned if the corresponding value in the list 'param$sampleEff' is 'TRUE'. -effects: Same covariate effects as in the list 'eff', but all saved into a single array. The array has dimensionality LxKxS, where L is the total number of covariate levels and their interaction levels, K is the number of clusters and S is the number of Gibbs samples. The correspondence of the levels of the covariates and the L rows of the array are mapped by the list 'design$ind$tab'. -sigma: A matrix of the inferred variance parameters. The matrix has dimensionality PxS, where P is the number of variables and S is the number of Gibbs samples. -design: -mat: An indicator matrix of the covariate levels that match to the array 'posterior$effects'. The matrix has dimensionality NxL, where N is the total number of samples and L is the total number of covariate levels and their interaction levels. -ind$tab: A mapping from the covariate levels to the rows of the array 'posterior$effects'. preComputeQlikeSum Description -A function for computing the likelihood of peak shape correlations. -Useful for pre-computing the likelihoods to make the clustering step significantly faster. Arguments -Q: Peak shape correlations data. An array with of real values between -1 and 1 or missing values (NA). The array has dimensions NxPxP, where N is the total number of samples and P is the total number of variables (peaks). -p.spike.inside: Likelihood of a missing value in a peak shape correlation matrix, for a pair of peaks in the same cluster. A real value between 0 and 1. -p.spike.outside: Likelihood of a missing value in a peak shape correlation matrix, for a pair of peaks in different clusters. A real value between 0 and 1. -shapes.beta.inside: Parameters of the beta distribution that defines the likelihood of an observed value in a peak shape correlation matrix, for a pair of peaks in the same cluster. A vector of length 2 with non-negative real values. -shapes.beta.outside: Parameters of the beta distribution that defines the likelihood of an observed value in a peak shape correlation matrix, for a pair of peaks in different clusters. A vector of length 2 with non-negative real values. Value -Q.dbeta.log: Logarithmic likelihood of observed peak shape correlations data. A list of two matrices 'inside' and 'outside' containing the log-likelihoods of peak shape correlations in the same and in different clusters, respectively, summed over all samples. Both matrices are real-valued and are of dimensionality PxP, where P is the total number of variables (peaks).