Profile

Cover photo
Mat Kelcey
Works at Google
Lives in San Francisco Bay Area
321 followers|306,076 views
AboutPostsCollectionsPhotosVideos+1'sReviews

Stream

Mat Kelcey

Shared publicly  - 
 
Neural machine translation, a recently proposed approach to machine translation based purely on neural networks, has shown promising results compared to the existing approaches such as phrase-based statistical machine translation. Despite its recent success, neural machine translation has its limitation in handling a larger vocabulary, as training complexity as well as decoding complexity increase proportionally to the number of target words. In this paper, we propose a method that allows us to use a very large target vocabulary without increasing training complexity, based on importance sampling. We show that decoding can be efficiently done even with the model having a very large target vocabulary by selecting only a small subset of the whole target vocabulary. The models trained by the proposed approach are empirically found to outperform the baseline models with a small vocabulary as well as the LSTM-based neural machine translation models. Furthermore, when we use the ensemble of a few models with very large target vocabularies, we achieve the state-of-the-art translation performance (measured by BLEU) on the English->German translation and almost as high performance as state-of-the-art English->French translation system.
Abstract: Neural machine translation, a recently proposed approach to machine translation based purely on neural networks, has shown promising results compared to the existing approaches such as phrase-based statistical machine translation. Despite its recent success, neural machine translation ...
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed store of experience. We used our architecture to implement the Deep Q-Network algorithm (DQN). Our distributed algorithm was applied to 49 games from Atari 2600 games from the Arcade Learning Environment, using identical hyperparameters. Our performance surpassed non-distributed DQN in 41 of the 49 games and also reduced the wall-time required to achieve these results by an order of magnitude on most games.
Abstract: We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the ...
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
Abstract: We replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and ...
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
This paper introduces Grid Long Short-Term Memory, a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images. The network differs from existing deep LSTM architectures in that the cells are connected between network layers as well as along the spatiotemporal dimensions of the data. It therefore provides a unified way of using LSTM for both deep and sequential computation. We apply the model to algorithmic tasks such as integer addition and determining the parity of random binary vectors. It is able to solve these problems for 15-digit integers and 250-bit vectors respectively. We then give results for three empirical tasks. We find that 2D Grid LSTM achieves 1.47 bits per character on the Wikipedia character prediction benchmark, which is state-of-the-art among neural approaches. We also observe that a two-dimensional translation model based on Grid LSTM outperforms a phrase-based reference system on a Chinese-to-English translation task, and that 3D Grid LSTM yields a near state-of-the-art error rate of 0.32% on MNIST.
Abstract: This paper introduces Grid Long Short-Term Memory, a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images. The network differs from existing deep LSTM architectures in that the cells are connected ...
3
Add a comment...

Mat Kelcey

Shared publicly  - 
 
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.
Abstract: We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and ...
1
Mat Kelcey's profile photo
Add a comment...

Mat Kelcey

Shared publicly  - 
 
In this paper, we consider the task of learning control policies for text-based games. In these games, all interactions in the virtual world are through text and the underlying state is not observed. The resulting language barrier makes such environments challenging for automatic game players. We employ a deep reinforcement learning framework to jointly learn state representations and action policies using game rewards as feedback. This framework enables us to map text descriptions into vector representations that capture the semantics of the game states. We evaluate our approach on two game worlds, comparing against a baseline with a bag-of-words state representation. Our algorithm outperforms the baseline on quest completion by 54% on a newly created world and by 14% on a pre-existing fantasy game.
Abstract: In this paper, we consider the task of learning control policies for text-based games. In these games, all interactions in the virtual world are through text and the underlying state is not observed. The resulting language barrier makes such environments challenging for automatic game ...
1
Add a comment...
Have them in circles
321 people
Gaurav Arora's profile photo
Emma Gilmour's profile photo
Mark Mansour's profile photo
David Collins's profile photo
Regina Ruby  Bernice Harris's profile photo
Arseniy Potapov's profile photo
Sean Taylor's profile photo
Mark Ryall's profile photo
Adrian CB's profile photo

Mat Kelcey

Shared publicly  - 
 
We consider the task of generative dialogue modeling for movie scripts. To this end, we extend the recently proposed hierarchical recurrent encoder decoder neural network and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram models. We show that its performance can be improved considerably by bootstrapping the learning from a larger question-answer pair corpus and from pretrained word embeddings.
Abstract: We consider the task of generative dialogue modeling for movie scripts. To this end, we extend the recently proposed hierarchical recurrent encoder decoder neural network and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram ...
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to ...
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user. Achieving context awareness is challenging due to data sparsity. We present a probabilistic suggestion model that is able to account for sequences of previous queries of arbitrary lengths. Our novel hierarchical recurrent encoder-decoder architecture allows the model to be sensitive to the order of queries in the context while avoiding data sparsity. Additionally, our model can suggest for rare, or long-tail, queries. The produced suggestions are synthetic and are sampled one word at a time, using computationally cheap decoding techniques. This is in contrast to current synthetic suggestion models relying upon machine learning pipelines and hand-engineered feature sets. Results show that it outperforms existing context-aware approaches in a next query prediction setting. In addition to query suggestion, our model is general enough to be used in a variety of other applications.
Abstract: Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user.
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates slightly better performance than RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.
Abstract: We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, ...
1
Mat Kelcey's profile photo
 
note: just a renaming of "weakly supervised memory networks"
Add a comment...

Mat Kelcey

Shared publicly  - 
 
Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning. In this work, we accelerate SGD by adaptively learning how to sample the most useful training examples at each time step. First, we show that SGD can be used to learn the best possible sampling distribution of an importance sampling estimator. Second, we show that the sampling distribution of a SGD algorithm can be estimated online by incrementally minimizing the variance of the gradient. The resulting algorithm - called Adaptive Weighted SGD (AW-SGD) - maintains a set of parameters to optimize, as well as a set of parameters to sample learning examples. We show that AW-SGD yields faster convergence in three different applications: (i) image classification with deep features, where the sampling of images depends on their labels, (ii) matrix factorization, where rows and columns are not sampled uniformly, and (iii) reinforcement learning, where the optimized and explore policies are estimated at the same time.
Abstract: Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning. In this work, we accelerate SGD by adaptively learning how to sample the most useful training examples at each time step. First, we show that SGD can be used to ...
1
Add a comment...

Mat Kelcey

Shared publicly  - 
 
Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a unified neural network framework which processes input sequences and questions, forms semantic and episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN can be trained end-to-end and obtains state of the art results on several types of tasks and datasets: question answering (Facebook's bAbI dataset), sequence modeling for part of speech tagging (WSJ-PTB), and text classification for sentiment analysis (Stanford Sentiment Treebank). The model relies exclusively on trained word vector representations and requires no string matching or manually engineered features.
Abstract: Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a unified neural network framework which processes input sequences and questions, forms semantic and episodic memories, ...
1
Add a comment...
Mat's Collections
People
Have them in circles
321 people
Gaurav Arora's profile photo
Emma Gilmour's profile photo
Mark Mansour's profile photo
David Collins's profile photo
Regina Ruby  Bernice Harris's profile photo
Arseniy Potapov's profile photo
Sean Taylor's profile photo
Mark Ryall's profile photo
Adrian CB's profile photo
Work
Occupation
Software Engineer
Skills
Machine learning, natural language processing, information retrieval, distributed systems.
Employment
  • Google
    Software Engineer, present
  • Wavii
    Software Engineer
  • Amazon Web Services
    Software Engineer
  • Lonely Planet
    Software Engineer
  • Sensis
    Software Engineer
  • Distra
    Software Engineer
  • Nokia
    Software Engineer
  • Australian Stock Exchange
    Software Engineer
Basic Information
Gender
Decline to State
Story
Tagline
data nerd wannabe
Introduction
I work in the Machine Intelligence group at Google building as-large-as-I-can-get neural networks for knowledge extraction.
Places
Map of the places this user has livedMap of the places this user has livedMap of the places this user has lived
Currently
San Francisco Bay Area
Previously
seattle - melbourne - calgary - london - sydney - hobart
Links
Contributor to
Mat Kelcey's +1's are the things they like, agree with, or want to recommend.
Clive Barker - Google Play
market.android.com

Clive Barker is an English author, film director, video game designer and visual artist best known for his work in both fantasy and horror f

Aphex Twin - Music on Google Play
market.android.com

Richard David James, best known by his stage name Aphex Twin, is a British electronic musician and composer. He has been described by The Gu

Chess Tactics Pro (Puzzles)
market.android.com

Get better at chess with this large collection of chess puzzles for all levels !This tactic trainer lets you practice in 3 different modes :

Google Search
market.android.com

Google Search app for Android: The fastest, easiest way to find what you need on the web and on your device.* Quickly search the web and you

NetHack
market.android.com

This is an Android port of NetHack: a classic roguelike game originally released in 1987.Main features ------------- * User-friendly interfa

Improving Photo Search: A Step Across the Semantic Gap
googleresearch.blogspot.com

Posted by Chuck Rosenberg, Image Search Team Last month at Google I/O, we showed a major upgrade to the photos experience: you can now easil

Machine Learning - Stanford University
ml-class.org

A bold experiment in distributed education, "Machine Learning" will be offered free and online to students worldwide during the fa

Game Theory
www.game-theory-class.org

Game Theory is a free online class taught by Matthew Jackson and Yoav Shoham.

Probabilistic Graphical Models
www.pgm-class.org

Probabilistic Graphical Models is a free online class taught by Daphne Koller.

RStudio
rstudio.org

News. RStudio v0.94 Available (6/15/2011). RStudio v0.94 is now available. In this release we've made lots of enhancements based on the

Hadoop 0.20.205.0 API
hadoop.apache.org

Frame Alert. This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web

Shapecatcher.com: Unicode Character Recognition
shapecatcher.com

You need to find a specific Unicode Character? With Shapecatcher.com you can search through a database of characters by simply drawing your

Duncan & Sons Automotive Service Center
plus.google.com

Duncan & Sons Automotive Service Center hasn't shared anything on this page with you.

Natural Language Processing
www.nlp-class.org

Natural Language Processing is a free online class taught by Chris Manning and Dan Jurafsky.

name value description hadoop.tmp.dir /tmp/hadoop-${user.name} A ...
hadoop.apache.org

name, value, description. hadoop.tmp.dir, /tmp/hadoop-${user.name}, A base for other temporary directories. hadoop.native.lib, true, Should

Apache OpenNLP Developer Documentation
incubator.apache.org

Written and maintained by the Apache OpenNLP Development Community. Version 1.5.2-incubating. Copyright © , The Apache Software Foundation.

ggplot.
had.co.nz

ggplot. An implementation of the grammar of graphics in R. Check out the documentation for ggplot2 - the next generation. ggplot is an imple

ChainMapper (Hadoop 0.20.1 API)
hadoop.apache.org

public class ChainMapper; extends Object; implements Mapper. The ChainMapper class allows to use multiple Mapper classes within a single Map

Neural net language models - Scholarpedia
www.scholarpedia.org

A language model is a function, or an algorithm for learning such a function, that captures the salient statistical characteristics of the d

tech stuff by mat kelcey
www.matpalm.com

my nerd blog. latent semantic analysis via the singular value decomposition (for dummies). semi supervised naive bayes. statistical synonyms

Public - a year ago
reviewed a year ago
1 review
Map
Map
Map