Below the short overview is provided from the Deep Learning Summer school 2016 in Montreal and papers with high impact.

Tips for training NNs:

- Random search over the grid search of parameters, as grid search repeats lots of experiments for each value of non-reasonable parameter value. Bayesian optimization in practice will not work better than a random search
- Tuning regularization has much less benefit than tuning learning rate, etc
- Early stopping on the validation dataset comes at low cost and should be used. We need to divide by 2 learning rate each time when validation error stops decreasing.)
- Don't neglect dropout, bigger regularized models are more powerful than simpler ones

High gradients are sensitive to noise, we want to keep gradients less than 1. However vanishing gradients phenomena can also occur. When gradient is large, dont trust it, use gradient norm clipping (Mikolov thesis)

- How to Construct Deep Recurrent Neural Networks (presenting architectures with several layers in RNN) and Architectural Complexity Measures of Recurrent Neural Networks (discussing three components of RNN structure: recurrent and feedforward depth, recurrent skip coefficient and how we can benefit from the recurrent skip connections)
- A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion (given a context and previous search queries, predict the most probable query)

Yoshua told that forward propagation could be used in the future instead of the back-propagation..

Pooling is used for feature invariance plus a larger receptive field. Depth of the network is the key. Towards the deeper layer both vertical and horizontal translations are better. Anneal the learning rates, take a small batch. Take better a bigger model and regularize it better than a smaller model.

The most interesting papers mentioned:- Going Deeper with Convolutions Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. The structure of the inception model, the winner of ILSVRC 2014 is described there.
- Yet another nonlinearity proposed for CNNS, where nonlinearity is learned for each layer Parametric Exponential Linear Unit for Deep Convolutional Neural Networks( These results suggest that varying the shape of the activations during training along with the other parameters helps to control vanishing gradients and bias shift, thus facilitating learning.)
**Residual networks**(everybody talks about them now, used for CNN, won the competition in ILSVRC 2015) Deep Residual Learning for Image Recognition (introducing short-cut connections of input to deeper layers)**Segmentation.**Research done by Girchik on the segmentation. Now doing the research in FB Learning to Segment Object Candidates Detecting fully grained segmentation not only borders, scale resistance- Segmentation: Research done by Ross Girshick on the segmentation. Now doing the research in facebook.
- Learning to Segment Object Candidates.Detecting fully grained segmentation not only borders, scale resistance

- Classifying scenes from Places database (places and objects are different to classify)
- Receptive field increases with layers. First layers are task independent, then go object representation layers and the last one is a classification layer.
- Learning Aligned Cross-Modal Representations from Weakly Aligned Data (signs of the objects, their borders, sketches, etc)
- Visualizing and Understanding Convolutional Networks (how we can detect the learning problems in conv nets, the minimum depth is vital for good performance)

- Problems in NLP: sparsity and zero probabilities - need to introduce smoothing to the probabilities.

Lack of generalization: neural language modelling (word-to-vec representation), that can generalize to the unseen relations. - RNN concept realizes non-markovial modelling, where the probability depends on the previous states and can take into account a sequence of previous words.
- In RNN the biggest problem is vanishing gradient. Cho says we can introduce shortcut connections.
- All we need in RNN for now: LSTMs or Gated Recurrent Units (GRU) model. Both work good and are trained easier than classical RNNS.
- GRU model: Learning phrase representations using rnn encoder-decoder for statistical machine translation.
- Effective Approaches to Attention-based Neural Machine Translation(global and local attention architectures, application german to english in both directions)
- Larger Context Language Modelling with RNN (improved LSTM capturing context).

This analysis suggests that larger-context language model improves the unconditional language model by capturing the theme of a document better and more easily.