"Mastering the game of Go with deep neural networks and tree search", Silver et al 2016 ( https://storage.googleapis.com/deepmind-data/assets/papers/deepmind-mastering-go.pdf / https://www.dropbox.com/s/vbv639tavdza2l3/2016-silver.pdf )
"The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state- of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Monte Carlo tree search (MCTS) 11,12 uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function 12 . The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves 13 . These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play 13–15 .
Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification 17 , face recognition 18 , and playing Atari games 19 . They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image 20 . We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network. We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network p σ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work 13,15 , we also train a fast policy p π that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network p ρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network v θ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.
We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board position and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission 24 (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy p π (a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.
Reinforcement learning of policy networks
The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL) 25,26 . The RL policy network p ρ is identical in structure to the SL policy network, and its weights ρ are initialized to the same values, ρ = σ. We play games between the current policy network p ρ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. We use a reward function r(s) that is zero for all non-terminal time steps t < T. The outcome z t = ± r(s T ) is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing. Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome 25
When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi 14 , a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi.
The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting. The problem is that successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game. When trained on the KGS data set in this way, the value network memorized the game outcomes rather than generalizing to new positions, achieving a minimum MSE of 0.37 on the test set, compared to 0.19 on the training set. To mitigate this problem, we generated a new self-play data set consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated. Training on this data set led to MSEs of 0.226 and 0.234 on the training and test set respectively, indicating minimal overfitting. Figure 2b shows the position evaluation accuracy of the value network, compared to Monte Carlo rollouts using the fast rollout policy p π ; the value function was consistently more accurate. A single evaluation of v θ (s) also approached the accuracy of Monte Carlo rollouts using the RL policy network p ρ , but using 15,000 times less computation.
Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. The Methods section provides full details of asynchronous and distributed MCTS.
In addition, we included the open source program GnuGo, a Go program using state-of-the-art search methods that preceded MCTS. All programs were allowed 5 s of computation time per move.
The results of the tournament (see Fig. 4a) suggest that singlemachine AlphaGo is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%) against other Go programs. To provide a greater challenge to AlphaGo, we also played games with four handicap stones (that is, free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi, respectively. The distributed version of AlphaGo was significantly stronger, winning 77% of games against single-machine AlphaGo and 100% of its games against other programs.
Finally, we evaluated the distributed version of AlphaGo against Fan Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 European Go championships. Over 5–9 October 2015 AlphaGo and Fan Hui competed in a formal five-game match. AlphaGo won the match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the first time that a computer Go program has defeated a human professional player, without handicap, in the full game of Go—a feat that was previously believed to be at least a decade away 3,7,31 .
During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov 4 ; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play. Furthermore, while Deep Blue relied on a handcrafted evaluation function, the neural networks of AlphaGo are trained directly from gameplay purely through general-purpose supervised and reinforcement learning methods.
MCTS has previously been combined with a policy that is used to narrow the beam of the search tree to high-probability moves 13 ; or to bias the bonus term towards high-probability moves 48 . MCTS has also been combined with a value function that is used to initialize action values in newly expanded nodes 16 , or to mix Monte Carlo evaluation with minimax evaluation 49 . By contrast, AlphaGo’s use of value functions is based on truncated Monte Carlo search algorithms 8,9 , which terminate rollouts before the end of the game and use a value function in place of the terminal reward. AlphaGo’s position evaluation mixes full rollouts with truncated rollouts, resembling in some respects the well-known temporal-difference learning algorithm TD(λ). AlphaGo also differs from prior work by using slower but more powerful representations of the policy and value function; evaluating deep neural networks is several orders of magnitude slower than linear representations and must therefore occur asynchronously. The performance of MCTS is to a large degree determined by the quality of the rollout policy. Prior work has focused on handcrafted patterns 50 or learning rollout policies by supervised learning 13 , reinforcement learning 16 , simulation balancing 51,52 or online adaptation 30,53 ; however, it is known that rollout-based position evaluation is frequently inaccurate 54 . AlphaGo uses relatively simple rollouts, and instead addresses the challenging problem of position evaluation more directly using value networks.
Symmetries. In previous work, the symmetries of Go have been exploited by using rotationally and reflectionally invariant filters in the convolutional layers 24,28,29 . Although this may be effective in small neural networks, it actually hurts performance in larger networks, as it prevents the intermediate filters from identifying specific asymmetric patterns 23 . Instead, we exploit symmetries at run-time by dynamically transforming each position s using the dihedral group of eight reflections and rotations, d 1 (s), ..., d 8 (s). In an explicit symmetry ensemble, a mini-batch of all 8 positions is passed into the policy network or value network and computed in parallel. For the value network, the output values are simply averaged, v θ ( s ) = 1 ∑ 8 j = 1 v θ ( d j ( s )) . For the policy network, the planes of output probabilities 8 are rotated/reflected back into the original orientation, and averaged together to provide an ensemble prediction, p σ (⋅| s ) = 1 ∑ 8 j = 1 d − j 1 ( p σ (⋅| d j ( s ))) ; this approach 8 was used in our raw network evaluation (see Extended Data Table 3). Instead, APV-MCTS makes use of an implicit symmetry ensemble that randomly selects a single rotation/reflection j ∈ [1, 8] for each evaluation. We compute exactly one evaluation for that orientation only; in each simulation we compute the value of leaf node s L by v θ (d j (s L )), and allow the search procedure to average over these evaluations. Similarly, we compute the policy network for a single, randomly selected rotation/reflection, d − j 1 ( p σ (⋅| d j ( s )))
Updates were applied asynchronously on 50 GPUs using DistBelief 61 ; gradients older than 100 steps were discarded. Training took around 3 weeks for 340 million training steps.
Policy network: reinforcement learning. We further trained the policy network by policy gradient reinforcement learning 25,26 . Each iteration consisted of a minibatch of n games played in parallel, between the current policy network p ρ that is being trained, and an opponent p ρ − that uses parameters ρ − from a previous iteration, randomly sampled from a pool of opponents, so as to increase the stability of training. Weights were initialized to ρ = ρ − = σ. Every 500 iterations, we added the current parameters ρ to the opponent pool. Each game i in the mini-batch was played out until termination at step T i , and then scored to determine the outcome z i t = ± r ( s T i ) from each player’s perspective. The games were then replayed to
i
∂ log p ( a i | s i )
ρ t t
determine the policy gradient update, ∆ρ = α ∑ n i = 1 ∑ T t = 1
( z i t − v ( s i t )) ,
n
∂ ρ
i
25
using the REINFORCE algorithm with baseline v ( s t ) for variance reduction. On the first pass through the training pipeline, the baseline was set to zero; on the second pass we used the value network v θ (s) as a baseline; this provided a small performance boost. The policy network was trained in this way for 10,000 minibatches of 128 games, using 50 GPUs, for one day. Value network: regression. We trained a value network v θ ( s ) ≈ v p ρ ( s ) to approximate the value function of the RL policy network p ρ . To avoid overfitting to the strongly correlated positions within games, we constructed a new data set of uncorrelated self-play positions. This data set consisted of over 30 million positions, each drawn from a unique game of self-play. Each game was generated in three phases by randomly sampling a time step U ~ unif{1, 450}, and sampling the first t = 1,... U − 1 moves from the SL policy network, a t ~ p σ (·|s t ); then sampling one move uniformly at random from available moves, a U ~ unif{1, 361} (repeatedly until a U is legal); then sampling the remaining sequence of moves until the game terminates, t = U + 1, ... T, from the RL policy network, a t ~ p ρ (·|s t ). Finally, the game is scored to determine the outcome z t = ±r(s T ). Only a single training example (s U+1 , z U+1 ) is added to the data set from each game. This data provides unbiased samples of the value function v p ρ ( s U + 1 ) = E [ z U + 1 | s U + 1 , a U + 1, ... T ~ p ρ ] . During the first two phases of generation we sample from noisier distributions so as to increase the diversity of the data set. The training method was identical to SL policy network training, except that the parameter update was based on mean squared error between the predicted values and the observed rewards,
∂ v ( s k )
α
θ
k
k
∆θ = m
∑ m
k = 1 ( z − v θ ( s )) ∂ θ . The value network was trained for 50 million mini-batches of 32 positions, using 50 GPUs, for one week.
[3 weeks + 0.14 weeks + 1 weeks = 4.14 weeks]
[looks like to train the entire thing took 4.14 weeks using 50 GPUs. so about 3.84 years with 1 GPU? Although you do get a big speedup from doing everything locally without the overhead of networking]
All programs received a maximum of 5 s computation time per move; ...With the exception of distributed AlphaGo, each computer Go program was executed on its own single machine, with identical specifications, using the latest available version and the best hardware configuration supported by that program (see Extended Data Table 6). In Fig. 4, approximate ranks of computer programs are based on the highest KGS rank achieved by that program; however, the KGS version may differ from the publicly available version.
The match against Fan Hui was arbitrated by an impartial referee. Five formal games and five informal games were played with 7.5 komi, no handicap, and Chinese rules. AlphaGo won these games 5–0 and 3–2 respectively (Fig. 6 and Extended Data Table 1). Time controls for formal games were 1 h main time plus three periods of 30 s byoyomi. Time controls for informal games were three periods of 30 s byoyomi. Time controls and playing conditions were chosen by Fan Hui in advance of the match; it was also agreed that the overall match outcome would be determined solely by the formal games. To approximately assess the relative rating of Fan Hui to computer Go programs, we appended the results of all ten games to our internal tournament results, ignoring differences in time controls."
More links:
https://www.youtube.com/watch?v=SUbqykXVx0A
https://googleblog.blogspot.com/2016/01/alphago-machine-learning-game-go.html
The games: http://www.usgo.org/news/2016/01/alphago-beats-pro-5-0-in-major-ai-advance/
https://news.ycombinator.com/threads?id=Inufu DeepMind employee comments
http://www.technologyreview.com/news/546066/googles-ai-masters-the-game-of-go-a-decade-earlier-than-expected/
http://www.bloomberg.com/news/articles/2016-01-27/google-computers-defeat-human-players-at-2-500-year-old-board-game
Even if Lee Sedol wins against AlphaGo in the March $1m tournament in Seoul, Go is nearly done, whether Google or Facebook; within 5 years, more likely within 1. So I don't find the title hyperbolic or one jot short of the truth. Remember that MCTS scales well with increasing hardware, and they have all the additional training time and tweaks since they finalized the numbers in order to write the paper (the match was in October, so anywhere up to 120 days ago). The NN approaches have been developed for, what, the first CNN paper came out a year or so ago? Progress has been lightning quick; it was not that many months ago that /r/machinelearning was laughing at how the FB NN couldn't handle ladders. Remember also how it went with chess: it was not long between regularly beating professional to beating Kasparov.
#go #ai #reinforcementlearning
"The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state- of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Monte Carlo tree search (MCTS) 11,12 uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function 12 . The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves 13 . These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play 13–15 .
Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification 17 , face recognition 18 , and playing Atari games 19 . They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image 20 . We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network. We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network p σ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work 13,15 , we also train a fast policy p π that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network p ρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network v θ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.
We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board position and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission 24 (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy p π (a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.
Reinforcement learning of policy networks
The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL) 25,26 . The RL policy network p ρ is identical in structure to the SL policy network, and its weights ρ are initialized to the same values, ρ = σ. We play games between the current policy network p ρ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. We use a reward function r(s) that is zero for all non-terminal time steps t < T. The outcome z t = ± r(s T ) is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing. Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome 25
When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi 14 , a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi.
The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting. The problem is that successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game. When trained on the KGS data set in this way, the value network memorized the game outcomes rather than generalizing to new positions, achieving a minimum MSE of 0.37 on the test set, compared to 0.19 on the training set. To mitigate this problem, we generated a new self-play data set consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated. Training on this data set led to MSEs of 0.226 and 0.234 on the training and test set respectively, indicating minimal overfitting. Figure 2b shows the position evaluation accuracy of the value network, compared to Monte Carlo rollouts using the fast rollout policy p π ; the value function was consistently more accurate. A single evaluation of v θ (s) also approached the accuracy of Monte Carlo rollouts using the RL policy network p ρ , but using 15,000 times less computation.
Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. The Methods section provides full details of asynchronous and distributed MCTS.
In addition, we included the open source program GnuGo, a Go program using state-of-the-art search methods that preceded MCTS. All programs were allowed 5 s of computation time per move.
The results of the tournament (see Fig. 4a) suggest that singlemachine AlphaGo is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%) against other Go programs. To provide a greater challenge to AlphaGo, we also played games with four handicap stones (that is, free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi, respectively. The distributed version of AlphaGo was significantly stronger, winning 77% of games against single-machine AlphaGo and 100% of its games against other programs.
Finally, we evaluated the distributed version of AlphaGo against Fan Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 European Go championships. Over 5–9 October 2015 AlphaGo and Fan Hui competed in a formal five-game match. AlphaGo won the match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the first time that a computer Go program has defeated a human professional player, without handicap, in the full game of Go—a feat that was previously believed to be at least a decade away 3,7,31 .
During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov 4 ; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play. Furthermore, while Deep Blue relied on a handcrafted evaluation function, the neural networks of AlphaGo are trained directly from gameplay purely through general-purpose supervised and reinforcement learning methods.
MCTS has previously been combined with a policy that is used to narrow the beam of the search tree to high-probability moves 13 ; or to bias the bonus term towards high-probability moves 48 . MCTS has also been combined with a value function that is used to initialize action values in newly expanded nodes 16 , or to mix Monte Carlo evaluation with minimax evaluation 49 . By contrast, AlphaGo’s use of value functions is based on truncated Monte Carlo search algorithms 8,9 , which terminate rollouts before the end of the game and use a value function in place of the terminal reward. AlphaGo’s position evaluation mixes full rollouts with truncated rollouts, resembling in some respects the well-known temporal-difference learning algorithm TD(λ). AlphaGo also differs from prior work by using slower but more powerful representations of the policy and value function; evaluating deep neural networks is several orders of magnitude slower than linear representations and must therefore occur asynchronously. The performance of MCTS is to a large degree determined by the quality of the rollout policy. Prior work has focused on handcrafted patterns 50 or learning rollout policies by supervised learning 13 , reinforcement learning 16 , simulation balancing 51,52 or online adaptation 30,53 ; however, it is known that rollout-based position evaluation is frequently inaccurate 54 . AlphaGo uses relatively simple rollouts, and instead addresses the challenging problem of position evaluation more directly using value networks.
Symmetries. In previous work, the symmetries of Go have been exploited by using rotationally and reflectionally invariant filters in the convolutional layers 24,28,29 . Although this may be effective in small neural networks, it actually hurts performance in larger networks, as it prevents the intermediate filters from identifying specific asymmetric patterns 23 . Instead, we exploit symmetries at run-time by dynamically transforming each position s using the dihedral group of eight reflections and rotations, d 1 (s), ..., d 8 (s). In an explicit symmetry ensemble, a mini-batch of all 8 positions is passed into the policy network or value network and computed in parallel. For the value network, the output values are simply averaged, v θ ( s ) = 1 ∑ 8 j = 1 v θ ( d j ( s )) . For the policy network, the planes of output probabilities 8 are rotated/reflected back into the original orientation, and averaged together to provide an ensemble prediction, p σ (⋅| s ) = 1 ∑ 8 j = 1 d − j 1 ( p σ (⋅| d j ( s ))) ; this approach 8 was used in our raw network evaluation (see Extended Data Table 3). Instead, APV-MCTS makes use of an implicit symmetry ensemble that randomly selects a single rotation/reflection j ∈ [1, 8] for each evaluation. We compute exactly one evaluation for that orientation only; in each simulation we compute the value of leaf node s L by v θ (d j (s L )), and allow the search procedure to average over these evaluations. Similarly, we compute the policy network for a single, randomly selected rotation/reflection, d − j 1 ( p σ (⋅| d j ( s )))
Updates were applied asynchronously on 50 GPUs using DistBelief 61 ; gradients older than 100 steps were discarded. Training took around 3 weeks for 340 million training steps.
Policy network: reinforcement learning. We further trained the policy network by policy gradient reinforcement learning 25,26 . Each iteration consisted of a minibatch of n games played in parallel, between the current policy network p ρ that is being trained, and an opponent p ρ − that uses parameters ρ − from a previous iteration, randomly sampled from a pool of opponents, so as to increase the stability of training. Weights were initialized to ρ = ρ − = σ. Every 500 iterations, we added the current parameters ρ to the opponent pool. Each game i in the mini-batch was played out until termination at step T i , and then scored to determine the outcome z i t = ± r ( s T i ) from each player’s perspective. The games were then replayed to
i
∂ log p ( a i | s i )
ρ t t
determine the policy gradient update, ∆ρ = α ∑ n i = 1 ∑ T t = 1
( z i t − v ( s i t )) ,
n
∂ ρ
i
25
using the REINFORCE algorithm with baseline v ( s t ) for variance reduction. On the first pass through the training pipeline, the baseline was set to zero; on the second pass we used the value network v θ (s) as a baseline; this provided a small performance boost. The policy network was trained in this way for 10,000 minibatches of 128 games, using 50 GPUs, for one day. Value network: regression. We trained a value network v θ ( s ) ≈ v p ρ ( s ) to approximate the value function of the RL policy network p ρ . To avoid overfitting to the strongly correlated positions within games, we constructed a new data set of uncorrelated self-play positions. This data set consisted of over 30 million positions, each drawn from a unique game of self-play. Each game was generated in three phases by randomly sampling a time step U ~ unif{1, 450}, and sampling the first t = 1,... U − 1 moves from the SL policy network, a t ~ p σ (·|s t ); then sampling one move uniformly at random from available moves, a U ~ unif{1, 361} (repeatedly until a U is legal); then sampling the remaining sequence of moves until the game terminates, t = U + 1, ... T, from the RL policy network, a t ~ p ρ (·|s t ). Finally, the game is scored to determine the outcome z t = ±r(s T ). Only a single training example (s U+1 , z U+1 ) is added to the data set from each game. This data provides unbiased samples of the value function v p ρ ( s U + 1 ) = E [ z U + 1 | s U + 1 , a U + 1, ... T ~ p ρ ] . During the first two phases of generation we sample from noisier distributions so as to increase the diversity of the data set. The training method was identical to SL policy network training, except that the parameter update was based on mean squared error between the predicted values and the observed rewards,
∂ v ( s k )
α
θ
k
k
∆θ = m
∑ m
k = 1 ( z − v θ ( s )) ∂ θ . The value network was trained for 50 million mini-batches of 32 positions, using 50 GPUs, for one week.
[3 weeks + 0.14 weeks + 1 weeks = 4.14 weeks]
[looks like to train the entire thing took 4.14 weeks using 50 GPUs. so about 3.84 years with 1 GPU? Although you do get a big speedup from doing everything locally without the overhead of networking]
All programs received a maximum of 5 s computation time per move; ...With the exception of distributed AlphaGo, each computer Go program was executed on its own single machine, with identical specifications, using the latest available version and the best hardware configuration supported by that program (see Extended Data Table 6). In Fig. 4, approximate ranks of computer programs are based on the highest KGS rank achieved by that program; however, the KGS version may differ from the publicly available version.
The match against Fan Hui was arbitrated by an impartial referee. Five formal games and five informal games were played with 7.5 komi, no handicap, and Chinese rules. AlphaGo won these games 5–0 and 3–2 respectively (Fig. 6 and Extended Data Table 1). Time controls for formal games were 1 h main time plus three periods of 30 s byoyomi. Time controls for informal games were three periods of 30 s byoyomi. Time controls and playing conditions were chosen by Fan Hui in advance of the match; it was also agreed that the overall match outcome would be determined solely by the formal games. To approximately assess the relative rating of Fan Hui to computer Go programs, we appended the results of all ten games to our internal tournament results, ignoring differences in time controls."
More links:
https://www.youtube.com/watch?v=SUbqykXVx0A
https://googleblog.blogspot.com/2016/01/alphago-machine-learning-game-go.html
The games: http://www.usgo.org/news/2016/01/alphago-beats-pro-5-0-in-major-ai-advance/
https://news.ycombinator.com/threads?id=Inufu DeepMind employee comments
http://www.technologyreview.com/news/546066/googles-ai-masters-the-game-of-go-a-decade-earlier-than-expected/
http://www.bloomberg.com/news/articles/2016-01-27/google-computers-defeat-human-players-at-2-500-year-old-board-game
Even if Lee Sedol wins against AlphaGo in the March $1m tournament in Seoul, Go is nearly done, whether Google or Facebook; within 5 years, more likely within 1. So I don't find the title hyperbolic or one jot short of the truth. Remember that MCTS scales well with increasing hardware, and they have all the additional training time and tweaks since they finalized the numbers in order to write the paper (the match was in October, so anywhere up to 120 days ago). The NN approaches have been developed for, what, the first CNN paper came out a year or so ago? Progress has been lightning quick; it was not that many months ago that /r/machinelearning was laughing at how the FB NN couldn't handle ladders. Remember also how it went with chess: it was not long between regularly beating professional to beating Kasparov.
#go #ai #reinforcementlearning
http://biz.chosun.com/site/data/html_dir/2016/01/28/2016012800398.html (Korean) says March 8-15.38w