Profile cover photo
Profile photo
Markus Breitenbach
I'm sciencing as fast as I can
I'm sciencing as fast as I can
About
Posts

Post has shared content
"'Machine Learning for Systems and Systems for Machine Learning', Dean 2017 NIPS slides: http://learningsys.org/nips17/assets/slides/dean-nips17.pdf

Aside from revealing some details on the enormous neural processing power of the second generation of TPUs (each pod of 64 TPU2s will deliver 11,500 teraflops, compared to, say, a 1080ti's ~13 teraflops), Jeff Dean reveals the incredibly ambitious plans to use reinforcement learning & neural nets for pretty much everything in Google's databases, compilers, OSes, networking, datacenters, computer chips...

> Anywhere We're Using Heuristics To Make a Decision!
>
> - Compilers: instruction scheduling, register allocation, loop nest parallelization strategies, ...
> - Networking: TCP window size decisions, backoff for retransmits, data compression, ...
> - Operating systems: process scheduling, buffer cache insertion/replacement, file system prefetching, ...
> - Job scheduling systems: which tasks/VMs to co-locate on same machine, which tasks to pre-empt, ...
> - ASIC design: physical circuit layout, test case selection, ...
>
> Anywhere We've Punted to a User-Tunable Performance Option! Many programs have huge numbers of tunable command-line flags, usually not changed from their defaults (`--eventmanager_threads=16 --bigtable_scheduler_batch_size=8 --mapreduce_merge_memory=134217728` `--lexicon_cache_size=1048576 --storage_server_rpc_freelist_size=128` ...)
>
> Meta-learn everything. ML:
>
> - learning placement decisions
> - learning fast kernel implementations
> - learning optimization update rules
> - learning input preprocessing pipeline steps
> - learning activation functions
> - learning model architectures for specific device types, or that are fast for inference on mobile device X, learning which pre-trained components to reuse, ...
>
> Computer architecture/datacenter networking design:
>
> - learning best design properties by exploring design space automatically (via simulator)
Add a comment...

Post has shared content
Issues in deep learning reproducibility and research: the achievements of deep learning & deep reinforcement learning cannot be denied - the methods work and very well - but it's unclear how well researchers are doing at introducing & evaluating improvements. It seems like a lot of research is merely showing improvements due to chance, or which improve performance on only a small subset of problems, or improvements on weak baselines like poorly-hyperparameter-optimized standard NN architectures. Unfortunately, I don't see this improving much soon because most of the countermeasures (training repeatedly with many random seeds, large-scale hyperparameter optimization for all considered models, ablation experiments etc) require ruinous amounts of GPU power at the moment - researchers not working at AmaGoogBookSoft have to choose between either working on a big model/dataset and effectively training just 1 model to SOTA, and training a lot of smaller non-SOTA but rigorously evaluated models; neither choice is good...
3 relevant papers on these issues in language modeling, GANs, and RL:

"Deep Reinforcement Learning that Matters", Henderson et al 2017 https://arxiv.org/abs/1709.06560

"In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted."

"On the State of the Art of Evaluation in Neural Language Models", Melis et al 2017: https://arxiv.org/abs/1707.05589

"Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing code bases and limited computational resources, which represent uncontrolled sources of experimental variation. We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset."

"Are GANs Created Equal? A Large-Scale Study", Lucic et al 2017 https://arxiv.org/abs/1711.10337

"Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the original one."

On this one I should note that the WGAN authors argue that the comparison is unfair to WGAN as it neglects what they see as WGAN's stability advantage across architectures (ie in shifting between various CNN archs, resnet or otherwise, RNNs etc), and not so much being more stable to train or systematically dominating the other GANs.
Add a comment...

Post has attachment

Post has attachment
Google new speech recognition model based on a Listen-Attend-Spell (LAS) end-to-end architecture.
Add a comment...

Post has attachment
Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee
Alireza Aghasi, Afshin Abdi, Nam Nguyen, Justin Romberg
(Submitted on 16 Nov 2016 (v1), last revised 23 Nov 2017 (this version, v4))

We introduce and analyze a new technique for model reduction for deep neural networks. While large networks are theoretically capable of learning arbitrarily complex models, overfitting and model redundancy negatively affects the prediction accuracy and model variance. Our Net-Trim algorithm prunes (sparsifies) a trained network layer-wise, removing connections at each layer by solving a convex optimization program. This program seeks a sparse set of weights at each layer that keeps the layer inputs and outputs consistent with the originally trained model. The algorithms and associated analysis are applicable to neural networks operating with the rectified linear unit (ReLU) as the nonlinear activation. We present both parallel and cascade versions of the algorithm. While the latter can achieve slightly simpler models with the same generalization performance, the former can be computed in a distributed manner. In both cases, Net-Trim significantly reduces the number of connections in the network, while also providing enough regularization to slightly reduce the generalization error. We also provide a mathematical analysis of the consistency between the initial network and the retrained model. To analyze the model sample complexity, we derive the general sufficient conditions for the recovery of a sparse transform matrix. For a single layer taking independent Gaussian random vectors of length N as inputs, we show that if the network response can be described using a maximum number of s non-zero weights per node, these weights can be learned from O(slogN) samples.
Add a comment...

Post has shared content
Notes from NIPS 2017

1) John Platt's talk on energy, fusion, and the next 100 years of human civilization. Definitely worth a watch! He does a great job of framing the problem and building up to his research groups focus. I'm still optimistic that renewables can provide more than the proposed 40% of energy for the world.

2) Kate Crawford's talk on bias in ML. As with Joelle Pineau's talk on reproducibility, Kate's talk comes at an excellent time to get folks in the community to think deeply about these issues as we build the next generation of tools and systems.

3) Joelle Pineau's talk (no public link yet available) on reproducibility during the Deep RL Symposium.

4) Ali Rahimi's test of time talk that caused a lot of buzz around the conference (the ``alchemy" piece beings at the 11minute mark). My takeaway is that Ali is calling for more rigor from our experimentation, methods, and evaluation (and not necessarily just more theory). In light of the findings presented in Joelle's talk, I feel compelled to agree with Ali (at least for Deep RL, where experimental methods are still in the process of being defined). In particular I think with RL we should open up to other kinds of experimental analysis beyond just ``which algorithm got the most reward on task X", and consider other diagnostic tools to understand our algorithms: when did it converge? how suboptimal is the converged policy? how well did it explore the space? how often did an algorithm find a really bad policy? why? where does it fail and why?. Ali and Ben just posted a follow up to their talk that's worth a read.

5) The Hierarchical RL workshop! This event was a blast. In part because I love this area and find there to be so many open foundational questions, but also because the speaker lineup and poster collection was fantastic. When videos become available I'll post links to some of my highlights, including the panel (see the end of my linked notes above for a rough transcript of the panel).
Add a comment...

Post has attachment

Post has shared content
Neural nets can replace other algorithms
The dominance of neural networks continues. In this paper you can read about how they can outperform and replace other algorithms.
In summary, we have demonstrated that machine learned models have the potential to provide significant benefits over state-of-the-art database indexes, and we believe this is a fruitful direction for future research.
Add a comment...

Post has attachment
Long Text Generation via Adversarial Training with Leaked Information
Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang
(Submitted on 24 Sep 2017 (v1), last revised 8 Dec 2017 (this version, v2))

Automatically generating coherent and semantically meaningful text has many applications in machine translation, dialogue systems, image captioning, etc. Recently, by combining with policy gradient, Generative Adversarial Nets (GAN) that use a discriminative model to guide the training of the generative model as a reinforcement learning policy has shown promising results in text generation. However, the scalar guiding signal is only available after the entire text has been generated and lacks intermediate information about text structure during the generative process. As such, it limits its success when the length of the generated text samples is long (more than 20 words). In this paper, we propose a new framework, called LeakGAN, to address the problem for long text generation. We allow the discriminative net to leak its own high-level extracted features to the generative net to further help the guidance. The generator incorporates such informative signals into all generation steps through an additional Manager module, which takes the extracted features of current generated words and outputs a latent vector to guide the Worker module for next-word generation. Our extensive experiments on synthetic data and various real-world tasks with Turing test demonstrate that LeakGAN is highly effective in long text generation and also improves the performance in short text generation scenarios. More importantly, without any supervision, LeakGAN would be able to implicitly learn sentence structures only through the interaction between Manager and Worker.
Add a comment...

Post has attachment
Solving internal covariate shift in deep learning with linked neurons
Carles Roger Riera Molina, Oriol Pujol Vila
(Submitted on 7 Dec 2017)

This work proposes a novel solution to the problem of internal covariate shift and dying neurons using the concept of linked neurons. We define the neuron linkage in terms of two constraints: first, all neuron activations in the linkage must have the same operating point. That is to say, all of them share input weights. Secondly, a set of neurons is linked if and only if there is at least one member of the linkage that has a non-zero gradient in regard to the input of the activation function. This means that for any input in the activation function, there is at least one member of the linkage that operates in a non-flat and non-zero area. This simple change has profound implications in the network learning dynamics. In this article we explore the consequences of this proposal and show that by using this kind of units, internal covariate shift is implicitly solved. As a result of this, the use of linked neurons allows to train arbitrarily large networks without any architectural or algorithmic trick, effectively removing the need of using re-normalization schemes such as Batch Normalization, which leads to halving the required training time. It also solves the problem of the need for standarized input data. Results show that the units using the linkage not only do effectively solve the aforementioned problems, but are also a competitive alternative with respect to state-of-the-art with very promising results.

Add a comment...
Wait while more posts are being loaded