AlphaZero

A brief walkthrough of the algorithm and a small toy implementation — for a neuro-AI course presentation.

Below: how AlphaZero learns Go, chess, and shogi from nothing; what each component contributes; and what happened when we trained it ourselves — on Connect Four and on 9 × 9 Go.

What it is MCTS Lesions 9×9 Go Papers

2 games

Connect Four & 9 × 9 Go, one algorithm

28,500

tournament games, Elo-ranked with CIs

50 %

vs a perfect solver — the theoretical ceiling

1 GPU

consumer hardware, end to end

↓ Scroll

What it is

Takeaway. AlphaZero is one neural network plus one search procedure, taught by self-play, that masters Go, chess, and shogi from scratch.

One network — a residual ConvNet with two output heads: a policy head (a probability over moves) and a value head (a single scalar for who's winning).
One search — Monte Carlo Tree Search guided by the policy prior and grounded by the value estimate. The search produces a sharper move distribution than the raw network.
Self-play loop — the network plays itself; the search distribution becomes the policy training target; the eventual game outcome becomes the value training target. No human games are ever shown to it.
Same algorithm, three games — chess, shogi, Go: identical hyperparameters, identical loop. Beats Stockfish, Elmo, and AlphaGo Zero within hours of training.

Inputs

17 binary planes

stones × 8 history + colour to move

Outputs

policy + value

$p(a\,|\,s)$ over actions, scalar $v(s)\in[-1,+1]$

Training signal

self-play only

no human games, no rollouts, no features

Where to start

▶

Documentary · 1 h 30 m

AlphaGo — The Movie

DeepMind's documentary on the 2016 match against Lee Sedol. Where the story starts.

DeepMind blog

AlphaGo Zero

Oct 2017

Article · DeepMind blog

AlphaGo Zero: Starting from scratch

DeepMind's plain-English summary of the no-human-data result.

The story behind AlphaZero Long read · ~6 min · click to expand

The wall

In 1997, IBM's Deep Blue defeated chess world champion Garry Kasparov. The frontier of human-vs-machine games moved east — to Go.

Go is a 4,000-year-old strategy game from China. The rules fit on a postcard, but the game is far harder to play well than chess. In a typical chess position, a player has about 35 legal moves to consider. In Go, it's around 250. Go has more legal board positions than there are atoms in the observable universe.

Brute force — the strategy that worked for chess — was hopeless. But the deeper problem was evaluation. In chess, a static rule like "queen worth 9, rook worth 5" gives a useful first guess at who is winning. In Go, no comparable rule exists. A position is good or bad based on patterns and influence and territory that strong players describe as "intuition" — and intuition has historically been the one thing machines couldn't fake.

Through the 1990s and 2000s, the best Go programs were stuck at the level of strong amateurs. Most experts thought it would take another decade before a computer could beat a top professional. Some thought it might never happen.

The breakthrough

The first real progress came from a French researcher named Rémi Coulom. In 2006 he introduced Monte Carlo Tree Search to Go. Instead of trying to enumerate all positions, MCTS played out random games from the current position and used the win rates to estimate which moves were good. It was crude — but it was the first algorithm that didn't need expert evaluation rules to play decent Go. Programs jumped from amateur to strong amateur level within a few years.

The next leap came in 2014–2015, when researchers showed that convolutional neural networks — the same architectures used for image recognition — could be trained on millions of human Go games to predict expert moves. The policy networks were uncanny: they could suggest moves a strong amateur might play, with no handcrafted rules.

DeepMind combined these two ideas. AlphaGo used a CNN trained on 30 million human moves to bias an MCTS search, with an additional value network to estimate who was winning. In October 2015 it defeated Fan Hui, the European Go champion, 5–0 in an even match — the first time a computer had beaten a professional Go player on a level playing field.

But Fan Hui wasn't at the very top of the rankings. The next step would be to play one of the legends of the game.

Seoul, March 2016

The match was held in Seoul over five games, with a million-dollar prize. The opponent: Lee Sedol, an 18-time world champion regarded by many as the strongest player of his generation. The match was watched live by over 280 million people, most of the Korean public among them.

Game 1: AlphaGo won. The crowd was stunned, but the game looked like one a strong human might have lost; commentators thought Lee had underestimated his opponent.

Game 2 was different. On move 37, AlphaGo played a shoulder hit on the fifth line — a move so unusual that the live commentator initially thought it was a mistake. In four decades of professional Go, no top player had played that move in that position. Within twenty moves, it became clear the move was brilliant. AlphaGo won.

Game 3: AlphaGo won again. The series was 3–0 in a best-of-five.

Game 4: Lee, playing white, found a move of his own. Move 78 — described afterwards as a "divine move," the kind of inspired strike that wins championship games. AlphaGo's evaluation collapsed; the system began playing moves that even amateurs could see were errors. Lee won. It was the only game AlphaGo would lose to a top human player. Ever.

Game 5: AlphaGo won. Final score 4–1.

In interviews afterwards, Lee reflected: "AlphaGo's moves were not always the moves a human would play, but they were good." The DeepMind team — engineers, researchers, Demis Hassabis — were jubilant but visibly emotional. They had not been certain they would win.

From learning Go to teaching itself

In May 2017, AlphaGo played the (then) world number one, Ke Jie. 3–0. After the match, DeepMind retired AlphaGo from competitive play.

Then came the surprises. In October 2017, DeepMind published AlphaGo Zero. It dropped human games entirely. It dropped the handcrafted rollouts. A single neural network learning from scratch via self-play, guided by a simpler MCTS. In three days of training it surpassed the original AlphaGo. In forty days it reached strength beyond anything DeepMind had built before.

Paper · Nature 550, 354–359

Mastering the game of Go without human knowledge

Silver et al. The original AlphaGo Zero paper. The curve in the thumbnail is Figure 6 — strength climbing past Lee-Sedol-era AlphaGo, from random play in just a few days.

Two months later, in December 2017, came AlphaZero: the same algorithm — same hyperparameters, same loop — applied to chess, shogi, and Go. Within hours, AlphaZero was defeating Stockfish (the strongest chess engine), Elmo (the strongest shogi engine), and AlphaGo Zero at its own game.

The chess world reaction was one of disorientation. AlphaZero played a sacrificial, positional, intuition-heavy style that experienced grandmasters described as "alien" — beautiful, but unlike how computer engines had ever played. Magnus Carlsen would later say AlphaZero had taught him things about chess he had never considered. Garry Kasparov wrote that AlphaZero "approaches the game of chess in the same way that humans do."

Paper · Science 362, 1140–1144

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

Silver et al. The AlphaZero paper. The thumbnail evokes Figure 1 — three training runs, one algorithm, three games beaten in hours.

Aftermath

In November 2019, Lee Sedol retired from professional Go. In a press conference he said: "Even if I become the number one, there is an entity that cannot be defeated."

That entity is the subject of this explainer.

The loop

Takeaway. Self-play makes data, training updates the network, the new network plays better — repeat. Each cycle is one iteration.

Self-play — the current network plays itself many times, choosing each move with MCTS.
Buffer — every position $(s, \pi, z)$ is stored: state, MCTS visit distribution, eventual outcome.
Train — minibatches from the buffer; loss has a value term, a policy term, and L2.
Evaluate — the new network is checked against the previous best (AGZ) or simply adopted (AZ).

Fig. 1. The outer loop. One pass through the four steps is called an iteration. AlphaGo Zero ran this loop over ~4.9 million self-play games; our own runs (§9–§11) use 6–40 iterations of 500–5,000 games each.

The network

Takeaway. A neural network that looks at the board and gives back two things — its best guess for where to play, and its best guess for who's winning.

What kind of network?

It's a convolutional neural network (ConvNet): the same family of network used for image recognition. A ConvNet slides small filters over a grid, looking for local patterns — corners, edges, textures in an image; eyes, threats, connected groups in a Go board. Stacked deep, it builds higher-level features from those local patterns.

AlphaZero uses a residual ConvNet (or "ResNet"): a deep ConvNet where each block adds its input back to its output. The skip connection lets gradients flow cleanly through dozens of layers, which is what makes deep networks trainable.

What goes in

The board is fed to the network as a stack of binary grids. Each grid is the same size as the board — 9 × 9 for our Go variant — and each cell holds either a 0 or a 1.

AlphaGo Zero used 17 such grids per position, encoding:

8 planes — my stones. Where my stones are now, and where they were on each of the previous 7 turns (so 8 total snapshots).
8 planes — opponent's stones. Same, for the other colour.
1 plane — colour to move. All 1s if it's my turn, all 0s otherwise.

Why the history? Because of the ko rule: in Go, you may not play a move that recreates an earlier board position. The network can't enforce that without seeing what the recent past looked like.

The shape 9 × 9 × 17 is specific to 9 × 9 Go. For 19 × 19 Go the planes are 19 × 19; for chess the planes encode pieces instead (one plane per piece type per colour, plus a few planes for castling rights, the move clock, and so on). The algorithm stays the same; only the input encoding is game-specific.

Our own implementation (§11) trims this to four planes — my stones, opponent stones, whose turn, and a bias plane, with no history stack. We can afford that because our rules engine masks illegal moves (including ko) before the policy is ever sampled, so the network doesn't have to learn legality from history planes. The trade-off: our network cannot condition on the recent past the way AlphaGo Zero's could.

What comes out

The trunk's last layer feeds into two short "heads" that each produce a different kind of answer.

Policy head — a probability distribution over all 82 actions (81 board points + the pass action). It's saying: "Given this position, I'd estimate B4 is good with probability 0.35, E5 with 0.10, D3 with 0.02, …" The probabilities sum to 1 and reflect the network's intuition about good moves.

Value head — a single scalar between −1 and +1. It's saying: "From this position, I expect to …" +1 = "win for sure", −1 = "lose for sure", 0 = "even", +0.4 = "slightly winning". This is the network's intuition about who's ahead.

Why one shared trunk?

Two heads, but one shared body of layers. Features that help rank moves are largely the same features that help judge positions: a group with two eyes is alive matters for value and means there's no urgent move there for policy. Forcing both heads through the same trunk acts as an implicit regularizer; the AlphaGo Zero paper showed ablating it hurt strength.

Layer specification

Trunk: small input convolution → N residual blocks (3×3 conv → batch-norm → ReLU → 3×3 conv → batch-norm → skip → ReLU). AlphaGo Zero used 20 blocks of 256 filters.
Policy head: 1×1 conv → flatten → linear → softmax over 82 actions.
Value head: 1×1 conv → flatten → hidden linear → linear → tanh.

If neural networks are new to you, start here

▶

Video · 3Blue1Brown · 19 min

But what is a Neural Network?

The canonical visual introduction to feedforward networks. Watch this first if the words "neuron" or "layer" feel abstract.

▶

Video · 3Blue1Brown · 21 min

Gradient descent, how neural networks learn

How a network's parameters are nudged by the loss function — the mechanism behind every "training step".

▶

Video · 3Blue1Brown · 11 min

Backpropagation, intuitively

How the gradient is computed, layer by layer. The reason "training" is feasible at all.

arXiv · CVPR

ResNet

2016

Paper · He et al. 2016

Deep Residual Learning for Image Recognition

The residual-block architecture used in AlphaZero's trunk.

Fig. 2. Network architecture: shared residual trunk, two short heads. Each head's last linear is the only unshared layer.

MCTS, with a prior

Takeaway. The network alone can only guess. Search lets the agent test its guesses by simulating moves forward — and the network makes the search dramatically more efficient by telling it where to look and how to score what it finds.

What is Monte Carlo Tree Search?

MCTS is a way to explore the future of a game by simulating it. Starting from the current board, it builds a tree of possible continuations and uses the results of those simulations to decide which move is actually best right now.

The name "Monte Carlo" is historical: classical MCTS picks random move sequences from a position, plays them out to the end (a rollout), and uses the random win/lose outcomes to estimate which moves are good. The randomness — like the dice tables of the Monte Carlo casino — gives the algorithm its name.

AlphaZero drops the random rollouts. When the search reaches a position it hasn't seen, the value network reads the position directly and says "I think this is +0.4 — slightly winning." No coin-flipping. Much faster, much more accurate. The policy network is used too, to suggest where the search should look first.

One simulation, four phases

Per move, AlphaZero runs ~600 of these simulations. Each one is the same four-step recipe:

Phase 1

Select

Start at the root (current board). At each node, descend to the child that scores highest under Q + U — repeat until you hit a leaf (a node the search hasn't expanded yet).

Phase 2

Expand

The leaf isn't the end of the game, so we add one new child node for every legal move available there.

Phase 3

Evaluate

The expanded leaf is sent through the network. The network returns a value V (who's winning, in [−1, +1]) and a policy prior P (a probability for each child move).

Phase 4

Backup

Walk back up the path you came down. At each node on the path: add 1 to the visit count N, and update the running mean value Q with the new V.

Fig. 3. One MCTS simulation, four phases. The whole tree is in memory; bold edges and filled nodes show what's active in that phase, dashed/outlined elements are new. After ~600 of these simulations, the root's visit counts give the move the agent will play.

How a child is picked: the PUCT rule

The interesting phase is Select. At every node, the algorithm has to pick which child to descend into. AlphaZero uses a formula called PUCT (Predictor + UCT) that scores each child by adding a "winning so far" term Q to an "exploration bonus" U:

$$ a^* = \arg\max_a \big[\, Q(s,a) \,+\, U(s,a) \,\big], \qquad U(s,a) = c_\text{puct} \cdot P(s,a) \cdot \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)} $$

Reading the terms one at a time:

$Q(s,a)$ — the exploit term. The average value seen so far when the search picked move a from state s. If a led to good positions in previous simulations, Q is high. "This has worked."
$P(s,a)$ — the network's policy prior. Probability that the network's policy head assigned to move a. Without this, the search would have to consider all 82 actions equally — most of which are obviously bad. "The network thinks this is promising."
$N(s,a)$ — the visit count. How many simulations have already gone through this move. Appearing as a denominator, it makes already-explored moves less attractive — pushing the search to try things it hasn't seen yet.
$\sqrt{\sum_b N(s,b)}$ — total visits to this state's siblings, taken under a square root. Standard UCT scaling; grows slowly with experience.
$c_\text{puct}$ — exploration constant. Bigger = more exploration. Typical values are 1–2.

So U is big when the network said the move looked good (high P) but we haven't tried it much (low N). Q is big when we've tried the move and it worked. PUCT picks the child where Q + U is biggest — exploiting what's worked while occasionally trying the network's untested suggestions.

What the network buys the search

The policy prior P replaces uniform exploration over the 82 legal actions. Without it, the search would waste most of its budget on terrible moves (placing in the centre on move 1, etc.).
The value V replaces classical rollouts. Without it, every leaf would have to be evaluated by playing the rest of the game out randomly — noisy, slow, and in Go often misleading.
Together: the search is guided by learned intuition and grounded by learned judgment. It looks where it should and scores what it finds.

What the search hands back

After all 600 simulations, the root has accumulated a visit count $N(s_0, a)$ for every legal move. The move distribution played by the agent is:

$$ \pi(a) \;\propto\; N(s_0, a)^{1/\tau} $$

where τ is a temperature (more on that in the next section). This π is the search-improved policy. Crucially, it is almost always better than the network's raw policy p, because the search has had a chance to look ahead and discard moves that looked good in p but turned out badly in simulation. That's why π — not p — is what the agent plays, and what the policy head is trained against.

A small worked example

Imagine the agent is about to make its very first move on an empty board. The network's policy head outputs probabilities — let's say it's slightly biased toward 3-3 and 4-4 corner points (the kinds of moves that tend to win, from training):

$P($corner 3-3$) = 0.15, \quad P($corner 4-4$) = 0.20, \quad P($centre tengen$) = 0.05, \quad …$

MCTS starts simulating. The first simulation can't use Q (nothing has Q yet — N is 0 everywhere), so PUCT reduces to "follow the network's prior." It picks 4-4. The search descends into 4-4, expands it, the network reads the resulting position and says V = +0.05 (slightly winning). That +0.05 is backed up. Now root's child 4-4 has $Q = 0.05$ and $N = 1$.

The next simulation re-scores all children. 4-4's exploration bonus dropped (because N is now 1). So PUCT might now pick 3-3 instead, expand it, get a value back, etc. Over 600 simulations, moves that consistently lead to high-value positions accumulate visits; moves that lead to losses get visited a few times and then ignored. The visit counts at the end are sharper and better-informed than the raw policy prior was.

Watch the search in action — and try building it

▶

Video · John Levine · 16 min

Monte Carlo Tree Search

The clearest visual MCTS walkthrough on YouTube. Uses tic-tac-toe to show all four phases. Used in many university courses.

▶

Video · DeepMind · 14 min

AlphaGo Zero: discovering new knowledge

A short overview from DeepMind on how AGZ bootstraps and what it discovers along the way.

▶

Video · David Silver lecture · 1 h 45 m

Deep Reinforcement Learning and AlphaGo

DeepMind's lead author walks through MCTS, the policy/value architecture, and how it all comes together. Long but worth it.

▶

Video · Robert Förster · 24 min

AlphaZero — Paper Walkthrough

A focused, accessible walkthrough of the 2018 AlphaZero paper. Reads the algorithm and the loss alongside annotated figures.

GitHub

AlphaZero
from scratch

code + tutorial

Code · GitHub repo + 4-video series

AlphaZero From Scratch (Robert Förster)

A clean, well-commented PyTorch implementation paired with a tutorial series. The best place to read the algorithm in code.

Self-play

Takeaway. The network plays games against itself. Every position is stored with the move distribution the search produced; when the game ends, every position is also stamped with who won. That's the training data.

How does it learn anything when it starts random?

This is the most natural question. The network is initialized randomly — its policy is uniform-ish noise, its value head outputs garbage. MCTS will run, but with a useless network it just stumbles around the tree. The first self-play games are basically random.

But notice what's still real: the final win/lose outcome z. When the game ends and the board is scored, somebody won. That fact does not depend on the network's beliefs at all. It is ground truth.

The network is then trained: "in this position you predicted $v = +0.1$, but the player to move actually lost ($z = -1$). Adjust." Repeated over many positions, this signal alone is enough to start shaping the value head. The network slowly learns that "my group has two eyes" correlates with winning, and that "my stones have no liberties" correlates with losing.

Once the value head is even slightly useful, MCTS becomes useful: it stops descending into obviously-losing positions. The search-improved distribution π is now meaningfully better than the raw policy p, so training p → π improves the policy. Better policy → better search → better π → better policy. The loop bootstraps from nothing.

The whole system is being trained on signals that it generated itself, with one external anchor: who won the game. That anchor is small, but it is non-negotiable, and over millions of games it is enough.

One self-play game, step by step

Load the current network. Start from the empty board.
At each move, run MCTS for ~600 simulations from $s_t$.
Form $\pi_t \propto N(s_t,\cdot)^{1/\tau}$ from the root visit counts.
Sample $a_t \sim \pi_t$, play it, append $(s_t, \pi_t, \text{player}_t)$ to the buffer.
When both players pass, score the board. Assign $z_t \in \{-1, 0, +1\}$ to every stored tuple, from that player's perspective.

Temperature schedule

$\tau = 1$ for the first ~30 moves — sample proportional to visit counts, explore openings.
$\tau \to 0$ afterwards — play the most-visited move deterministically.

Dirichlet noise at the root

Before each move's MCTS, perturb the root prior:

$$ P_\text{root}(a) \;=\; (1-\varepsilon)\, P(a) \;+\; \varepsilon\, \eta_a, \quad \boldsymbol{\eta} \sim \mathrm{Dirichlet}(\alpha) $$

$\varepsilon = 0.25$.
$\alpha$ scales inversely with the typical number of legal moves: $0.03$ for 19×19 Go, ~$0.15$ for 9×9 Go, $0.3$ for chess.
Noise only at the root — lower nodes get explored by the search itself; the root needs an outside kick to try moves outside the prior's top picks.

On self-play and reinforcement learning

▶

Video · David Silver · 1 h 28 m

RL Course — Lecture 1: Introduction

First lecture of David Silver's classic RL course at UCL. Background for everything that happens in self-play.

Reinforcement Learning: An Introduction

The standard reference. Chapter 16 covers AlphaZero specifically; everything before it explains the machinery.

Loss

Takeaway. One loss with three terms. The value head learns to predict the outcome; the policy head learns to copy the search.

$$ L \;=\; (z - v)^2 \;-\; \boldsymbol{\pi}^{\!\top} \log \boldsymbol{p} \;+\; c\,\|\boldsymbol{\theta}\|^2 $$

$(z - v)^2$ — value MSE. The value head learns the eventual outcome.
$-\boldsymbol{\pi}^{\!\top}\log\boldsymbol{p}$ — policy cross-entropy. The policy head learns the MCTS visit distribution.
$c\,\|\boldsymbol{\theta}\|^2$ — L2 regularization ($c \approx 10^{-4}$).

p seeds the search; the search returns a better π; training p → π distils that improvement back. Next iteration's p is stronger, so the search built on it is stronger too. This is policy iteration in neural form.

Pseudocode

Takeaway. Four pieces — outer loop, self-play game, MCTS search, training step — each fits on one screen. Click between tabs to read them as a connected whole.

# Outer loop — runs for N iterations
network ← random_init()
best    ← network
buffer  ← empty()        # replay buffer of (s, π, z)

for iteration in 1..N:

    # 1. self-play with the current best network
    for g in 1..games_per_iter:
        examples ← play_self_play_game(best)
        buffer.extend(examples)

    # 2. train a candidate on the buffer
    candidate ← network.copy()
    for step in 1..train_steps:
        batch ← buffer.sample(batch_size)
        train_step(candidate, batch)        → tab 4

    # 3. evaluate: AGZ gates by 55% arena win-rate; AZ adopts unconditionally
    if arena(candidate, best) ≥ 0.55:
        best ← candidate

    network ← candidate

What this does. Each iteration generates new self-play games with the current best network, trains a candidate on them, and (in AGZ) only adopts the candidate if it clearly wins an arena. The buffer keeps the last ~500k positions, so each training step sees a mix of recent and older data.

# One self-play game
def play_self_play_game(network):
    state    ← initial_state()
    examples ← []
    move_num ← 0

    while not state.terminal():
        # run MCTS to get visit counts at this state
        N ← mcts_search(state, network)         → tab 3

        # temperature schedule: explore early, exploit late
        τ ← 1.0 if move_num < 30 else 0.0
        π ← N^(1/τ) / sum(N^(1/τ))

        # store the training example (outcome filled in later)
        examples.append((state, π, state.player))

        # sample and play an action
        a     ← sample(π)
        state ← state.play(a)
        move_num += 1

    # game ended — stamp every example with z from that player's view
    winner ← state.score()
    for (s, π, player) in examples:
        z ← +1 if winner == player else −1
        yield (s, π, z)

What this does. Plays one game from empty board to terminal. At each move: MCTS produces visit counts, those become the search-improved policy π, an action is sampled. When the game ends, every stored position is stamped with the actual outcome z.

# MCTS search — the heart of AlphaZero
def mcts_search(root_state, network):
    root ← Node(root_state)
    # root prior + Dirichlet noise for exploration
    p, v ← network(root_state)
    root.P ← (1−ε)·p + ε·dirichlet(α)
    root.expand()

    for _ in 1..num_simulations:        # typically 600–800
        node, path ← root, [root]

        # PHASE 1: SELECT — descend by PUCT until leaf
        while node.expanded():
            a ← argmax_a [ node.Q[a] + c·node.P[a] · √(ΣN) / (1+node.N[a]) ]
            node = node.children[a]
            path.append(node)

        # PHASE 2 + 3: EXPAND + EVALUATE
        if not node.state.terminal():
            p, v ← network(node.state)
            node.P ← p
            node.expand()
        else:
            v ← node.state.terminal_value()

        # PHASE 4: BACKUP — propagate v up the path
        for n in reverse(path):
            n.N += 1
            n.W += v
            n.Q  = n.W / n.N
            v    = −v       # flip sign every level (opponent's turn)

    return root.visit_counts()

What this does. Runs num_simulations simulations from the current root. Each simulation: pick a path by PUCT, expand the leaf, get a value from the network, backup up the path. The return is just the visit count per child of the root — that's what self-play turns into π.

# One training step on a minibatch from the buffer
def train_step(network, batch):
    # batch contains (states, target_policies π, target_values z)
    states, π_target, z_target = batch

    # forward pass — network reads positions, predicts p and v
    p, v = network(states)

    # the AlphaZero loss has three terms
    value_loss  = mean((z_target − v)²)        # MSE on outcome
    policy_loss = −mean(sum(π_target · log(p)))  # cross-entropy on visit distribution
    reg_loss    = c · sum(θ² for θ in network.parameters())

    loss = value_loss + policy_loss + reg_loss

    # standard backprop + SGD update
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    return loss.item()

What this does. One gradient step. The value head is pulled toward the actual outcome z; the policy head is pulled toward the search-improved distribution π. L2 keeps weights bounded. The loss is the only place the three components of the system meet each other.

Self-play dominates wall-clock: each game requires hundreds of network forward passes per move.
Training is fast — the network is small and the gradient step is standard.
The arena gate (AGZ only) prevents bad updates from corrupting future self-play data.

AGZ → AZ

Takeaway. The 2018 AlphaZero paper drops three AGZ-specific tricks so the same algorithm works on chess and shogi. Otherwise identical.

	AlphaGo Zero (2017)	AlphaZero (2018)
Data augmentation	8 Go symmetries (rotate, reflect)	None — chess and shogi aren't symmetric
Best-network	Frozen until a candidate wins ≥55% arena	Continuously updated, no gate
Outcome	$z \in \{-1, +1\}$	$z \in \{-1, 0, +1\}$ (draws matter in chess)
Hyperparameters	Tuned per game	Same set across chess, shogi, Go

Everything else — network shape, loss, MCTS-with-prior, self-play — is unchanged.

From AlphaZero to today

DeepMind blog

AlphaZero

Dec 2018

Article · DeepMind blog

AlphaZero: shedding new light on chess, shogi, and Go

DeepMind's plain-English writeup of the generalization to three games.

NATURE

588

2020

Paper · Nature 588, 604–609

Mastering Atari, Go, chess and shogi by planning with a learned model

MuZero — the successor that also learns a model of the game's dynamics from scratch.

Our project

What follows is our toy implementation. The algorithm above is AlphaZero as described in the paper. Below is what happened when we trained a small version of it ourselves and measured what each component was doing.

What we actually ran

Framework: AlphaZero.jl v0.5.5 — Jonathan Laurent's open-source AlphaZero implementation in Julia. We use the in-tree Connect Four game module as our test bed.
Games: Connect Four first — it ships with AlphaZero.jl and serves as the controlled test bed where a perfect solver exists. Then 9 × 9 Go, on a GameInterface we wrote ourselves (src/Go9.jl, ~500 lines of legal-move + ko + capture + scoring code; §11). Same algorithm, same pipeline, two very different games.
Network: ResNet, 5 blocks × 128 filters, 1.67 M parameters for Connect Four. Dual policy + value heads. (The Go networks in §11 are smaller.)
Hardware & budget: One NVIDIA RTX 4070 SUPER (12 GB). Connect Four reached its first landmark after 6 iterations (2000 self-play games and 600 MCTS sims per move, ~3.5 h) and was then resumed to 15 iterations for the full run. The 9 × 9 Go runs took ~45–60 min per iteration.
What we measure: three things. (a) a training curve — for each iteration, the network plus search is played against a 1000-rollout classical MCTS opponent, and we record the win rate. (b) lesions — at the end of training, we degrade the agent one component at a time and measure how badly it does. (c) tournaments — round-robins crossing training stage, network size, search budget and lesions, turned into Elo ratings with confidence intervals (§11).

The numbers and figures below are the actual measurements from this run. Where a particular lesion variant has not yet been measured, the chart marks it explicitly.

Training curve

Takeaway. Real data from our 6-iteration Connect Four run. After one iteration the full AlphaZero player already beats classical MCTS ~62% of the time; after six iterations it wins 96.5% and the bare policy network (no MCTS) reaches 46%.

For each iteration's checkpoint, we played 256 games against a fixed classical opponent — a Monte Carlo Tree Search with 1000 random rollouts per move. The opponent never changes, so the curve below is the actual strength of our agent over training time.

Win rate vs. 1000-rollout MCTS opponent

256 games per iteration. Black = full AlphaZero (network + 600 MCTS sims). Grey = bare network, no search.

Fig. 4. Actual training trajectory from our run (Connect Four · 6 iterations · ~3.5 h on one RTX 4070 SUPER). The full AlphaZero player ramps fast; the bare network learns more slowly but reaches roughly even play against an MCTS opponent that uses 1000 random rollouts.

What the curve shows

The full system learns fast. One iteration of training takes the AlphaZero player from 8.6 % to 62 % win-rate against a fixed MCTS opponent.
The network alone learns slower. Without search, the policy head improves over time but trails the full system by roughly 50 percentage points. That gap is what MCTS is buying us at inference.
Both curves rise together. The network's quality and the search's effectiveness aren't independent — better policy priors make MCTS more efficient; better-search-derived training targets make the policy better. This is the policy-iteration loop described in §06 in action.

Note on the y-axis: this first run plots win rate against a fixed external opponent (classical MCTS with 1000 rollouts) rather than Elo, because we had not yet configured per-iteration weight saving. Every later run did save its checkpoints — and §11 reports exactly the AlphaZero-paper-style Elo: round-robin tournaments between checkpoints, network sizes, search budgets and lesions, on both games, with confidence intervals.

Vs. the perfect solver (Pascal Pons)

The MCTS-rollouts opponent above is a strong baseline, but it is not optimal. To know how close to actual perfect play the agent gets, we benchmark it against the Pons solver — a famous open-source Connect Four solver that uses iterative deepening alpha-beta with a precomputed opening book, and which has been used as the de-facto perfect baseline for Connect Four AI research for nearly a decade.

One subtlety. Connect Four is a first-player win: if both sides play perfectly, white always wins. So the theoretical maximum win rate any agent can achieve against perfect play, over alternating-colour matches, is 50 % — winning all 50 as-white games and losing all 50 as-black games. The Pons opponent never makes a mistake, so as-black we lose every game; the only question is how well we play as-white.

Win rate vs. the Pons perfect solver, by training iteration

100 games per iteration, colours alternated. Dashed line at 50 % is the theoretical ceiling (since Connect Four is a first-player win). Iters 0–6 not shown — we did not save those network weights.

Fig. 4b. Real measurement against the Pons perfect solver. 100 games per iteration, alternating colours. Iter 10 was the first iteration to hit the theoretical ceiling (winning all 50 as-white games against perfect play), and the network plateaued there — exactly matching the fact that no iteration after iter 10 passed our arena gate during training.

What this tells us

By iter 10, our 1.67-million-parameter network plays Connect Four optimally as the first player. Against a solver that knows every position perfectly, we win every single game in which we move first. Connect Four is a first-player win, so 50 % is the strongest possible result on the alternating-colour benchmark.
The plateau is real, not a measurement artifact. The arena gate stopped accepting candidates after iter 10 — at the time we thought this was just noise. The Pons benchmark confirms: there is genuinely no room left to improve. The training did its job, then correctly stopped.
~10 iterations is enough. Connect Four is small, so this number doesn't transfer to bigger games (the original AlphaZero paper ran for hundreds of iterations on Go). But it does illustrate cleanly what "convergence" looks like in this kind of self-play loop.

Lesions

Takeaway. All four lesions measured against the perfect Pons solver. Removing search costs 29 points (50 → 21). Replacing the value head with classical rollouts is worse than removing search entirely — it costs all 50 points (50 → 0). Smaller capacity costs only 5 points.

The classic AlphaZero ablation question is: how much of the agent's strength comes from each component? We can answer it by degrading the agent one piece at a time and measuring the cost. We use the same iter-10 baseline network throughout (the converged checkpoint from the training curve), and play 100 games per arena against two opponents: RandomPlayer (uniform-random moves) and the Pons perfect solver.

#	Lesion	What it removes	Implementation
L1	Policy-only	MCTS at inference	`NetworkPlayer` at τ = 0 (argmax of policy head, no search)
L2	Sims sweep	Variable search budget	Baseline player with MCTS sims ∈ {2, 4, 16, 64, 256, 600}
L3	No value head	Learned value	Custom hybrid MCTS oracle: keep network policy, replace V with classical MC rollout (game played out with random moves)
L4	Small net	Network capacity	Retrain Connect Four from scratch with 2 blocks × 32 filters (93 k params, vs 1.67 M baseline)

Headline result: win rate vs perfect play

This is the chart that holds the news. Same network (iter 10) where applicable; same opponent (Pons); same 100-game sample size with alternating colours. The dashed red line at 50 % is the theoretical ceiling (Connect Four is a first-player win, so the best any agent can do over alternating colours is win all 50 as-white games and lose all 50 as-black games).

Win rate vs Pons (perfect solver), by lesion variant

iter-10 network where applicable. 100 games per variant, colours alternated. Ceiling = 50 %.

Baseline (full)

50%

L4 — small net

45%

L1 — policy-only

21%

L3 — no value head

Random play (anchor)

~0%

Each bar is normalized to the 50 % ceiling — the bar's full width represents the theoretical maximum.

Fig. 5. Real lesion measurements vs the perfect solver. Four bars, four stories: L4 (capacity ablation) barely moves — the small network plays nearly as well as the big one. L1 (search ablation) loses about half the agent's ability: the network alone is mediocre. L3 (value-head ablation) is the surprise — replacing the learned value with classical MC rollouts completely destroys the agent's play against perfect opposition, even though the policy prior is intact.

vs Random — saturated for everyone

Against a uniform-random opponent, all four variants (baseline, L1, L3, L4) win 100 / 100 games. Random is too weak to discriminate; you need a strong opponent like Pons to see the lesion costs. We mention the random-opponent number only to confirm that all variants are at least beating pure noise — the question is how they fare against perfect play.

Sims sweep (measured)

The classic AlphaZero compute curve plots strength against the number of MCTS simulations per move. We ran our iter-6 network against the same 1000-rollout MCTS opponent from the training curve, varying the trained agent's own MCTS budget from 2 to 600 simulations.

Win rate vs. 1000-rollout MCTS, by trained agent's MCTS budget

Same iter-6 network, 30 games per sims value. x-axis on log₂ scale.

Fig. 6. Measured sims sweep, vs the same 1000-rollout MCTS opponent used for the training curve. Strength rises roughly log-linearly from 4 to 256 simulations per move — the canonical AlphaZero compute curve, reproduced on Connect Four. The slight dip at 600 sims (96.7 % vs 100 % at 256) is statistical noise from the small sample (30 games per point); the underlying win rate is plateaued at near-perfect. Note also the surprise at sims = 2: tiny search can do worse than no search at all (the bare network at iter 6 wins 45.7 % against the same opponent — see Fig 5), because two simulations add noise to the visit distribution without giving meaningful look-ahead.

For reference, the same iter-6 network with sims ≥ 4 beats RandomPlayer 100 % of the time (with sims = 2 it still wins 88 %; see results/sims_sweep.csv). Random play is too weak to differentiate the search budget once it's non-trivial.

What each lesion teaches us

L4 — capacity matters surprisingly little. The small network (93 k parameters, 18× smaller than the baseline) plays Connect Four almost as well as the big one: 45 % vs Pons compared to the baseline's 50 % ceiling. Connect Four is small enough that a tiny network can still represent the relevant patterns; the AlphaZero loop converges what it can. For a bigger game (9 × 9 Go), we expect this gap to widen.
L1 — search is doing a lot of work, but not all of it. The bare policy network wins 21 % vs Pons — that's not nothing (random is 0 %), but it's far below the baseline's 50 %. About 30 percentage points come from MCTS at inference time, even with a perfectly good policy prior to start from. The network has learned to play Connect Four, but search is the second half of the algorithm.
L3 — the headline negative result. The learned value head is irreplaceable. Drop just the value head — keep the policy prior, keep 600 MCTS simulations, keep everything else — and the agent wins zero games out of 100 against Pons. Including all 50 as-white games, where the agent has the inherent first-player advantage. Classical Monte-Carlo rollouts at the leaves are so noisy in Connect Four that they actively mislead a search guided by the correct policy. L3 is worse than L1. A trained value head is doing something fundamentally different and more useful than "many random simulations averaged together" — exactly the AlphaZero paper's claim.
L2 — strength scales smoothly with search budget. From 2 to 256 sims, win rate rises roughly log-linearly (the canonical AlphaZero compute curve, Fig 6). The trained network amplifies its strength predictably with more thinking time. Below ~4 sims, search becomes worse than no search — too few simulations distort the policy distribution without giving useful look-ahead.

Ranked by Pons-benchmark cost

The cleanest one-line summary: when you compare against perfect play, the lesions rank like this from most-damaging to least:

Rank	Lesion	Win rate vs Pons	Cost from baseline
1	L3 — no value head (MC rollouts at leaves)	0 %	−50 pp (complete collapse)
2	L1 — policy-only (no MCTS)	21 %	−29 pp
3	L4 — small network	45 %	−5 pp
—	Baseline	50 %	(ceiling)

9 × 9 Go

Takeaway. Stage two: we wrote a 9 × 9 Go rule engine for AlphaZero.jl, trained two network sizes from scratch on one GPU, and ranked 14 agent variants in a 9,500-game tournament. Training depth dominates everything (≈ +1,000 Elo from iteration 0 to 20). The Connect Four lesion ordering replicates. And a two-act surprise: at iteration 20 the 2.2× smaller network leads the pool — until doubling the training flips the order back to the big net (the epilogue below).

What we built

AlphaZero.jl knows nothing about Go. Everything the framework needs — legal moves, captures, ko, scoring — lives in one file we wrote, src/Go9.jl (~500 lines), implementing the GameInterface contract:

Rules: standard placement and capture; suicide illegal; simple ko (a move may not recreate the board position from just before the opponent's last move); two consecutive passes end the game; Tromp-Taylor area scoring.
Network input: a 9 × 9 × 4 tensor — own stones, opponent stones, a whose-turn plane, and a bias plane. All 8 board symmetries are exposed for training-data augmentation, exactly as AlphaGo Zero did.
Three engineering choices that matter when reading the numbers below: komi = 0 (we keep the game symmetric for early training rather than compensating the second player; standard 9×9 komi is 5.5–7.5); simple ko rather than positional superko (rare long cycles are cut by a cap instead); and a 162-move safety cap (2 × board area), after which the board is scored as it stands. The cap binds in 12–13 % of games involving weak agents (untrained nets, classical rollout-MCTS) and essentially never between trained agents. Cap-terminated boards with large neutral regions are also the only realistic source of "draws" at komi 0.

Training two sizes from scratch

	Large net	Small net
Architecture	ResNet, 5 blocks × 64 filters	ResNet, 2 blocks × 32 filters
Parameters	756 k	336 k (2.2× smaller)
Self-play	500 games / iteration · 200 MCTS sims / move · Dirichlet noise ε = 0.25, α = 0.15
Learning	Adam, lr 5·10⁻⁴ · L2 10⁻⁴ · replay buffer 100 k → 300 k samples
Arena gate	80 games / iteration, permissive −10 % threshold (early candidates keep flowing)
Iterations trained	40 (tournament pool froze at 20; the epilogue uses 40)	40 (same)

Both runs train on the same single GPU at roughly 45–60 min per iteration. Rule-engine sanity checks the framework can't do for us — ko handling, suicide masks, symmetry/policy consistency, scoring — are covered by scripts/test_go9_integration.jl and scripts/smoke_test_go.jl.

Does training work? Three independent yardsticks

Against its own past. Iteration 20 beats iteration 0 50 : 0. In a five-way round-robin over iterations {0, 5, 10, 15, 20} (200 sims, 30 games per pair) every later checkpoint beats every earlier one; the narrowing 15-vs-20 margin (16 : 14) shows improvement flattening by iteration 20.
Against classical search. The trained large net (200 sims/move) beats a vanilla MCTS using 1,000 random rollouts per move 30 : 0, and RandomPlayer 20 : 0.
Search still scales. With the iter-20 net held fixed, every higher search budget wins its round-robin pairing: 16 < 64 < 200 < 800 < 3,200 sims (e.g. 3,200 beats 800 by 21 : 9). Unlike Connect Four — which our agent plays optimally by iteration 10 — 9 × 9 Go is nowhere near saturated: more thinking keeps helping.

The grand tournament — 14 agents, 9,500 games

The factorial design crosses 2 network sizes × 3 training stages (iter 0 / 10 / 20) × 2 search budgets (50 / 200 sims), plus the two lesions from §10 applied to the strongest large net: L1 (raw policy, no MCTS) and L3 (MCTS whose leaf values come from random rollouts instead of the value head). All 91 pairs play 100 games with colours strictly alternated, plus a 400-game per-colour calibration set. Ratings come from a Bradley-Terry-Davidson model (draws allowed) with a global first-mover-advantage term, fit jointly on all 9,500 games; uncertainty from 400 bootstrap refits of the entire tournament.

Bradley-Terry-Davidson Elo, with 95 % bootstrap CIs (weakest = 0)

14 agents · 9,100 tournament games + 400 calibration games · komi 0 · colours alternated · low budget = 50, high = 200 MCTS simulations per move

Small net · 20 training rounds · high MCTS budget

1157

Large net · 20 training rounds · high MCTS budget

1090

Large net · 20 training rounds · low MCTS budget

868

Small net · 20 training rounds · low MCTS budget

839

Small net · 10 training rounds · high MCTS budget

821

Large net · 10 training rounds · high MCTS budget

800

L1 — search removed (raw Large net, 20 rounds)

722

L3 — random-rollout values (Large net, 20 rounds)

660

Large net · 10 training rounds · low MCTS budget

641

Small net · 10 training rounds · low MCTS budget

566

Large net · untrained · high MCTS budget

101

Small net · untrained · high MCTS budget

Small net · untrained · low MCTS budget

Large net · untrained · low MCTS budget

Fig. 7. Final Elo ranking; whiskers are 95 % bootstrap confidence intervals (400 refits of the whole tournament; the two weakest agents' intervals extend below 0 and are clipped at the axis — exact values in results/tournament_go9_ranking_v2.csv). The two untrained-net agents sit ~1,000 Elo below their trained versions. The fitted first-mover advantage is +6 Elo [−46, +54] — indistinguishable from zero (see the colour note below). A matplotlib rendering of this chart lives at figures/tournament_go9_ranking_v2.png; the full pairwise matrix follows as Fig. 8.

14-by-14 pairwise win-rate matrix for the 9x9 Go tournament, rows and columns sorted by Elo — **Fig. 8.** The full pairwise picture behind Fig. 7: win + ½-draw rate of each row agent against each column agent (100 games per cell), sorted by Elo. The clean red-above-diagonal / blue-below structure means the ranking is essentially transitive — no rock-paper-scissors pockets anywhere in the pool. (File: `figures/tournament_go9_heatmap_v2.png`.)

Four findings, with uncertainty

Training depth dominates. Holding size and search fixed, going from iteration 0 to 20 is worth between +830 and +1,090 Elo depending on the size × sims cell. The single largest gap anywhere in the ranking is the +465 Elo [+421, +516] cliff separating the weakest trained agent from the strongest untrained one. Network size and search budget shuffle agents within a training stage; training moves them between stages.
Search budget is the second axis. At iteration 20, 200 sims vs 50 sims is worth +222 Elo [+188, +258] for the large net (+318 for the small net). This matches the round-robin result above: on Go, unlike Connect Four, the compute dial keeps paying.
The capacity surprise. The 2.2× smaller network is the strongest agent in the pool: +67 Elo [+31, +104] over the large net at the same iteration and search budget, P(stronger) = 1.00 — the ordering never reversed in 400 bootstrap refits. Strikingly, the ordering flips at low search: at 50 sims the large net leads by +30 Elo [−3, +63] (P = 0.96). So the bigger net evaluates single positions slightly better, but the small net profits more from search — consistent with its cheaper, possibly better-calibrated value estimates compounding over 200 simulations. One honest caveat: we trained one run per size. "The small net is at least as strong at this scale" is solid within this tournament; the stronger causal claim ("the large net is over-parameterised for 9×9") would need replicate training runs with different seeds — and in fact the catch-up epilogue below resolves it: with twice the training, the order flips back.
The lesions replicate on a harder game. L1 (remove search, keep the trained policy) lands at 722 [693, 748] — a −368 Elo cost, equivalent to throwing away roughly half the training (it ranks between the iter-10 large net at 50 and 200 sims). L3 (keep search, replace the learned value head with random rollouts) is worse than no search at all: 660 [637, 686], i.e. −62 Elo below L1 [−94, −26]. On Connect Four L3 collapsed to 0 % against the perfect solver; on Go the penalty shows as a deficit rather than a collapse — there is no perfect opponent here to punish every misled line, but the ordering L3 < L1 < full agent is identical. A trained value head is not interchangeable with "average many random playouts". (With one instructive exception: in Connect Four pool play the order flips, because short-game rollouts carry real signal against imperfect opponents — see the Connect Four tournament below.)

Was colour handled fairly? (komi = 0)

Two safeguards, one direct measurement. Design: within every 100-game pairing the colours strictly alternate — each agent plays exactly 50 games as first player and 50 as second — so no entry in the ranking ever compares agents at unequal colour exposure. Measurement: a separate 400-game calibration set recorded outcomes per colour; across it, the first player won 201 decided games and the second 198 (1 draw). Fitting a global first-mover term jointly on all 9,500 games gives +6 Elo [−46, +54] — indistinguishable from zero. This is itself informative: at komi 0, perfect 9×9 play is a first-player win (that is why real Go pays the second player 5.5–7.5 points of komi), so the absence of any measurable first-mover edge says our agents are still far from the regime where the first move can be converted reliably. If training is pushed further, this number should grow — and komi should be raised to ~7 before then.

Epilogue — double the training, and the big net passes

The capacity surprise begged a question: is the large net too big for 9 × 9 Go, or merely under-fed at 500 self-play games per iteration? The Connect Four flip below suggested the latter. So we doubled the training — both nets from iteration 20 to 40 (≈33 GPU-hours) — and ran a focused per-colour benchmark: each iteration-40 net against the other, against its own iteration-20 self, and against the other's iteration-20 self. 100 games per pairing, colours alternated.

Pairing	Result	Reading
Large 40 vs Small 40	63 : 37	The order flips back. Joint-fit gap +79 Elo [+29, +135], P(stronger) = 1.00
Large 40 vs Large 20	88 : 12	≈ +288 Elo — the big net was still climbing steeply
Small 40 vs Small 20	70 : 30	+134 Elo [+89, +183] — half the large net's gain: the small net is nearing its ceiling
Large 40 vs Small 20	71 : 29	clears the former champion decisively
Small 40 vs Large 20	77 : 23	… though the small net also clears the old runner-up

Refit jointly over all 10,000 Go games, the final ranking puts large net, iteration 40, at 1,374 Elo [1,328, 1,431], ahead of small net, iteration 40, at 1,295 [1,258, 1,346] — with the iteration-20 ordering (small first) preserved below them. The capacity story completes: at matched, modest data the small net wins; give the big net twice the data and it overtakes, while the small net's returns flatten. Capacity isn't wasted on 9 × 9 Go — it just has to be paid for in self-play games. (The first-mover advantage, refit on 500 more per-colour games, tightens to −9 Elo [−36, +18]: still zero. And in 500 games between trained agents there was not a single draw — consistent with draws being a weak-agent, move-cap artifact.)

16-agent Elo ranking including iteration-40 networks; large net iteration 40 on top — **Fig. 9.** The full 16-agent picture after the catch-up runs. The top four bars alternate sizes — large 40, small 40, small 20, large 20 — which is the crossover in one image: more capacity loses at low data and wins at high data. (File: `figures/tournament_go9_iter40_ranking_v2.png`; per-pairing data: `results/go9_catchup.csv`.)

And when, exactly, did it catch up? Every training round was checkpointed, so we can ask directly: matched-stage duels every five rounds (100 games each) trace the whole arc. Untrained, the two nets are statistically even. Through rounds 10–20 the small net leads — the large net wins only 41–42 % of games. At rounds 25–30 they are back to even (50 %, 47 %). By round 35 the large net is ahead (59 %), and by 40 clearly so (63 %). The crossover sits around rounds 25–30 — roughly 12,500–15,000 self-play games before the extra capacity started paying rent. The early dip is the most instructive part: in the low-data regime the extra parameters actively hurt before they help.

Large-net win share against the small net at matched training rounds, crossing 50 percent around rounds 25 to 30 — **Fig. 10.** The catch-up, localized. Large-net win share (win + ½·draw) against the small net at the *same* training stage, high MCTS budget; shaded band = Wilson 95 % CI, 100–200 games per point. The U-shape is the capacity story in one curve: even at random initialisation, behind through mid-training, level by round 25, clearly ahead from round 35. (Data: `results/go9_crossover.csv` plus matched pairings from the tournament, calibration and catch-up files.)

The same tournament on Connect Four — and two reversals

For symmetry we ran the identical design back on Connect Four: 2 network sizes (1.67 M vs 93 k parameters) × 3 training stages × 3 search budgets {64, 600, 2,400 sims} + L1 + L3 — 20 agents, 190 pairs, 19,000 games, with per-colour outcomes recorded for 114 pairs. Full ranking and pairwise matrix: figures/tournament_c4_ranking_v2.png, figures/tournament_c4_heatmap_v2.png. Read next to the Go results, three things stand out:

Capacity flips back. On Connect Four the big net wins everywhere: at iteration 15 it leads the small net by ~270–300 Elo at every search budget (2,944 vs 2,675 at 2,400 sims; 2,886 vs 2,582 at 600; 2,699 vs 2,384 at 64). One reading consistent with both games: C4 training fed the network 5,000 self-play games per iteration, ten times Go's 500 — capacity pays off when self-play data is plentiful relative to model size, and Go's larger net may simply be under-trained for its parameter count rather than "too big". (Testing that would mean longer Go runs or richer self-play budgets.)
First-mover advantage: +192 Elo [+179, +207] — about a 75 % win probability between equal players, draws aside — versus +6 Elo [−46, +54] on Go. Connect Four is a proven first-player win and our near-optimal agents convert that advantage relentlessly; our Go agents are nowhere near strong enough to exploit the first move at komi 0. Same model, same measurement, two regimes — and a live demonstration of why strict colour alternation inside every pairing is non-negotiable.
The lesion ordering inverts in pool play. Against the perfect solver, L3 (random-rollout values) was catastrophic (0 %) while L1 (no search) salvaged 21 %. Inside the all-imperfect tournament pool the order flips: L3 (1,610) towers over L1 (1,202) by +408 Elo [+357, +463] — 81 : 8 in their head-to-head. Short Connect Four rollouts carry genuine signal, enough to out-calculate imperfect opponents; they only turn fatal against an opponent that punishes every rollout-induced error. On 9 × 9 Go, where games run to 162 moves over 81 points, rollouts are so uninformative that L3 falls below L1 even in pool play (above). So the value head's irreplaceability is regime-dependent: it matters most when games are long (rollouts become noise) or opposition is strong (errors become fatal) — and 9 × 9 Go is already in that regime.
The tournament independently re-detects convergence. Iterations 11 and 15 are statistically tied (+13 Elo [−16, +44] at 2,400 sims) — echoing the Pons-solver finding that the network hit its ceiling around iteration 10 — while iteration 7 → 11 is still worth ~120–150 Elo. Search keeps paying even at the ceiling: 600 → 2,400 sims buys +58 Elo for the converged large net.

Elo ranking of all 20 Connect Four tournament agents with confidence-interval whiskers — **Fig. 11.** The Connect Four ranking, same model and conventions as Fig. 7. The solid blue block on top is the story: the big net leads at every iteration × search cell. Note L3 (green) towering 408 Elo above L1 (purple) — the pool-play reversal discussed above — and the fitted first-mover advantage of +192 Elo in the axis label. (Pairwise matrix: `figures/tournament_c4_heatmap_v2.png`.)

Discussion

Takeaway. AlphaZero's strength is relational — no single component is doing the work alone. The policy proposes, the value evaluates, the search amplifies both.

A decomposition

Fast intuition — the policy head, run alone, is a one-shot pattern matcher. No deliberation, no consequences. In our run, this is the L1 measurement: 21 % win rate against the Pons perfect solver — meaningful skill, but well below the baseline's 50 %.
Slow evaluation — MCTS uses the policy as a starting point and the value head as a leaf evaluator. Spends compute to estimate consequences. The full system reaches the 50 % ceiling against perfect play.
The value head is doing something distinctive. L3 measured this: replace the learned value with classical Monte-Carlo rollouts and the agent collapses to 0 % vs Pons — worse than L1's 21 % with no search at all. The trained value head is not interchangeable with "average over many random simulations". The same ordering holds on 9 × 9 Go (§11), where L3 ranks 62 Elo below L1 — a deficit rather than a collapse, but the same lesson. §11's Connect Four tournament adds the flip side: against imperfect opponents on a short game, rollout values are serviceable — the learned value head earns its keep precisely when games get long or opposition gets strong.
Together these resemble the model-free / model-based split familiar from decision-making research — useful as a thinking tool, not as a brain model.

What the lesions don't tell us

Inference-time ablations only. They don't say which component was necessary for training to succeed — that needs training-time lesions (retraining with a head removed), ideally across multiple random seeds.
Nothing here speaks to the role of the implicit self-play curriculum, which scales game difficulty with the network's strength as training proceeds.
Connect Four is small enough that a strong agent eventually plays optimally. We may be measuring "how much does each component help you reach optimal play?" rather than "how much does each component help you find clever play?" — §11's 9 × 9 Go results are the more discriminating version: there nothing is near optimal, search keeps paying out to 3,200 sims, and the lesion ordering still holds.

Intuition without deliberation is mediocre; deliberation without intuition is slow; AlphaZero shows what happens when the two are designed to teach each other.

Long-form conversations with the people who built it

▶

Podcast · Lex Fridman · 1 h 31 m

David Silver — AlphaGo, AlphaZero, MuZero

The principal designer of the series describes how each system extended the last.

▶

Podcast · Lex Fridman · 1 h 51 m

Demis Hassabis — DeepMind, AI, superintelligence

The DeepMind founder on what AlphaZero changed about how he thinks about general AI.

References

Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
Silver, D., Hubert, T., Schrittwieser, J. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140–1144 (2018).
Silver, D., Huang, A., Maddison, C. J. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Rosin, C. Multi-armed bandits with episode context. AMAI 61, 203–230 (2011) — origin of the PUCT rule.
He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. CVPR (2016) — the residual block.
Laurent, J. AlphaZero.jl. github.com/jonathan-laurent/AlphaZero.jl — the framework used in this work.