Machine Learning for Big Data Group

News

Our paper "Learning Conditional Independence Structure For High-Dimensional Uncorrelated Vector Processes" has been accepted for publication at ICASSP 2017 in New Orleans

Our paper "Smooth Graph Signal Recovery Via Efficient Laplacian Solvers" has been accepted for publication at ICASSP 2017 in New Orleans

A preprint of our paper "Sparse Label Propagation" is now available at ArXiv under arXiv:1612.01414 [cs.LG]

Alex Jung is an invited speaker for this year's Stochastic Sauna. If you are interested in a refreshing mix of stochastics and sauna, register yourself under https://math.aalto.fi/en/research/ stochastics/sauna2016/

A preprint of our paper "Scalable Semi-Supervised Learning over Networks using Nonsmooth Convex Optimization" is now available at ArXiv under arXiv:1611.00714 [cs.LG]

Modern technology is generating data at an unprecedented scale. This big data [1] can be considered as the fuel of the modern information society. However, in order to leverage this flood of data we have to extract the useful information, manifesting itself in certain patterns, out of it. While lacking a precise definition, big data problems share typically three main characteristics [1, 2]: (i) large volume, (ii) high velocity and (iii) wide variety of the datasets. The algorithms confronted with Big Data consume different types of resources, e.g., hardware cost, communication bandwidth and energy. Our interest is in the fundamental limits and tradeoffs, as well as, in efficient methods for machine learning from big data considering constraints on the above mentioned algorithmic resources

We will focus on learning problems arising from big data over networks, i.e., we consider massive datasets with an intrinsic graph (or network) structure. A graph consists of nodes and (directed) edges between them. Such an intrinstic network or graph structure is present in a wide range of big data applications [3]:

Social Networks. nodes represent persons; edges represent acquaintance relations.
Sensor Networks. nodes represent sensors; edges represent sensor connections.
Communication Networks. nodes represent transceivers; edges represent communication channels.
Physical Universe. nodes represent space-time events; edges represent particles moving in space-time.
Biological Networks. nodes represent proteins; edges represent families.

chip design	internet	bioinformatics
social networks	universe	material science

Another key characteristic, beside graph structure, the massive datasets occurring in many applications are generated by processes which have only very few degrees of freedom. The recordings of human voice can be modelled concisely using only a few vocal folds and the images of human faces can be efficiently represented by modelling the action of only a few facial muscles. The intrinsic low-dimensional structure of seemingly high-dimensional objects is captured by sparse models which represent high-dimensional data using only a small number of parameters or degrees of freedom. Signal processing based on sparse models has been proven extremely powerful within the recent framework of compressed sensing (CS).

In order to construct machine learning algorithms that achieve the fundamental tradeoffs for learning from big data over networks, we will develop a theory of compressed graph signal processing (CGSP) for sparse graph signals defined over complex networks. For the implementation of learning methods arising from the theory of CGSP, we will also adopt the paradigm of algorithmic weakening [4]. This paradigm rests on the observation that for growing sample size there is a tendency that simpler (less complex) but less accurate algorithms outperform more sophisticated methods since the latter are too slow to capitalize on the entire dataset [4]. Therefore, our goal is to derive architectures allowing for a flexible tradeoff between computational resources and the resulting learning accuracy. This tradeoff will be enabled by constructing hierarchies of methods which are ordered (adversary) by computational resource requirement and learning accuracy.

A key challenge in big data applications is the limitedness of computational resources relative to the volume and speed of big data. Characterizing the effect of constraints on computational resources is highly non-trivial. The available results lack in several regards: they have not been tailored for sparse graph signal models and do not apply to general fully distributed learning algorithms where communication between the processing nodes is over several rounds or iterations. In particular, the specific learning problems considered so far do not include sparse graph signal models. Moreover, prior work presupposes the existence of a broadcast channel ("public blackboard"), which is unrealistic for typical applications facing big data over networks. However, the study of fully decentralized learning methods for graph-structured data (big data over networks), also taking into account constraints on link quality and bitrates, requires new tools. As the main analytic tool we propose to use a unifying viewpoint from network information theory: We interpret resource- and data-access constrained learning from big data over networks as an abstract communication problem taking place over a graphical network. The constraints on resources will be phrased in terms of channel capacities of equivalent "observation, communication and computational channels" which represent data access limitations (e.g., sequential online learning) and computational resources (runtime, communication overhead). This approach allows to use network information theory to characterize fundamental limits and tradeoffs for learning from big data over networks.

[1] T. White, Hadoop: The Definitive Guide, O'Reilly, 2009.
[2] S. Cui, A. Hero, Z.-Q. Luo, and J.M.F. Moura, Eds., Big Data over Networks, Cambridge Univ. Press, 2016.
[3] L. Lov iasz,"Large networks and graph limits", American Mathematical Society, vol. 60, 2012.
[4] M. I. Jordan, "On statistics, computation and scalability," Bernoulli, vol. 19, no. 4, pp. 1378-1390, Sept. 2013.