Research of the Probabilistic Machine Learning Group

We develop new methods for probabilistic modeling, Bayesian inference and machine learning. Our current focuses are learning from multiple data sources, Gaussian processes, data visualization, retrieval of relevant data, machine learning for user interaction, personalized medicine, and brain signal analysis and neuroinformatics. Almost all of this work involves development of new models and methods; we additionally focus on a few specific problems in Bayesian inference methods and theory.

Bayesian inference methods and theory

We develop Bayesian inference methods and theory, especially for model assessment and selection, and approximative inference, (e.g. Laplace, expectation propagation, variational inference, Monte Carlo and ABC). Furthermore we develop and apply Bayesian inference methods in scope of several specific model families and applications as described below.

Representative Publications

Learning from multiple data sources

Analysis of multiple, partially connected data sources inspires new directions of research on developing data analysis methods. We have generalized factor analysis to groups of variables (group factor analysis, GFA), data fusion or integration to kernelized matrix factorizations, and simultaneous factorizations of multiple connected matrices and tensors. A big headache is to tackle large p, small n (large dimensionality, small sample sizes), relevant for instance in genomics. Relevant keywords include multi-view learning and multi-task learning.

Representative Publications

Gaussian processes

Gaussian processes provide a way to set priors on function space allowing flexible modeling of non-linearities, interactions in many applications. We develop methods for approximative Bayesian inference for various Gaussian process based models and apply these models in several applications, such as survival analysis. We have developed widely used Gaussian process software GPstuff.

Representative Publications

Data visualization

Visualization of mutual similarities of entities in high-dimensional data sets is a central problem in exploratory data analysis and knowledge discovery. It is generally not possible to show all the similarity relationships within a high-dimensional data set perfectly on a low-dimensional display; some properties become necessarily lost or misrepresented. To explicitly approach this problem, we formulate visualization as a visual information retrieval task and quantify the necessary trade-off in terms of standard information retrieval measures, precision and recall. Various extensions exist, in particular fast implementations for large data.

Representative Publications

Retrieval of relevant data

Large repositories of genome-wide measurement data inspire the research question of how to systematically relate different data sets. Re-usage of data sets increases the statistical power of novel studies and opens up the possibility to put biological results in the context of previous studies. To complement keyword search functionalities provided by most repositories for retrieval of similarly annotated studies, we developed machine learning methods that relate studies through their actual measurement data, along with visualization tools that allow exploring and interpreting the results. In the REx project (Retrieval of Relevant Experiments), relevance is defined by a model of biology that is both data- and knowledge-driven. The principles are obviously not restricted only to biology.

Representative Publications

Machine learning for user interaction

We have recently developed a technique called interactive intent modeling that allows humans to direct exploratory search. The technique has been implemented in a real-world search engine SciNet. The search engine anticipates user's search intents and visualizes them on a novel "Intent Radar" screen. This work is part of HIIT's augmented search, research, and knowledge work initiative. More generally, we are interested in new user interaction principles which combine machine learning with HCI.

Representative Publications

Personalized medicine

Underlying personalized medicine is an interesting data-analysis problem: Based on a large number of measured variables, predict which treatments would be effective. What makes the problem statistically hard is that the number of relevant samples is very small. More generally, similar problems underlie digital health and computational molecular biology. We address the problems with probabilistic modelling of multiple data sources.

Representative Publications

Brain signal analysis and neuroinformatics

We develop new machine learning methods for analyzing brain signal measurements done under naturalistic conditions. We develop models stemming from the same basic principles for analysis of EEG, MEG and fMRI data, and work primarily within the Bayesian modeling framework. Examples of our current solutions include extracting statistical dependencies between the brain activity measurements and rich feature representations of the stimulus, studying statistical dependencies between brain activity measurements of several subjects, and modeling brain dynamics.

Representative Publications