Notes on Alpha-Beta

There's a wealth of useful information online about the alpha-beta search algorithm - the strictly more efficient and no-less-accurate improvement on minimax for solving zero-sum, perfect information games like chess. Despite plenty of material, I've found most of it lacking in conveying a decent intuition on what the eponymous values alpha and beta actually represent in the algorithm itself. This is a problem because the many heuristic improvements layered on top of simple alpha-beta in a strong game-playing program tend to require a proper grasp on these concepts. This note attempts to record some useful intuition - both for my own future reference and for you, dear reader, who have arrived here with the aid of His Majesty's Google wondering who or what alpha and beta really are.

For the sake of completeness, I've also included quite a detailed introduction to minimax and alphabeta. Scroll down to the Appendix if you need to get up to speed on these first.

Alpha and beta

Before I say anything else, it's worth establishing some conventions. Almost every real implementation of minimax or alpha-beta will use the "negamax framework" wherein there is a single function which calls itself recursively and uses minus signs to flip the perspective of the players. (See Appendix for more on this.) I'll mostly stick to that view here, and so within the confines of any particular execution of the function alpha_beta, we should think of whichever player is moving as seeking to maximize over the values returned for each of the child moves. Here's a reference pseudocode implementation. I will refer to the players as Max and Min in the discussion.

function alpha_beta(alpha, beta, depth_left) {
   if (depth_left == 0) return evaluate()

   for (move in legal_moves) {
      make_move(move)
      score = -alpha_beta(-beta, -alpha, depth_left - 1)
      unmake_move(move)

      if (score >= beta)
         return beta   // fail hard beta-cutoff

      if (score > alpha)
         alpha = score // alpha acts like max in minimax
   }

   return alpha;
}

The first thing to notice is that alpha plays a role very much akin to max in the minimax algorithm. As we iterate through the possibilities, it tracks what the best move has been so far.

A minimax search tree, showing beta cutoffs. — By keeping track of each player's best options so far, we can prune some branches from the search tree entirely.

Looking at the diagram above, let's imagine running alpha_beta at the root node, which eventually must return the true value, $6$ , for the whole tree.

Let's think through how the values of alpha and beta change as the iteration proceeds over the three possible moves from left to right. I'll write the values as a tuple $(\alpha, \beta)$ for ease. At the top of the function, this being a root node, we have $(-\infty, \infty)$ . First, we try the leftmost child. This subtree eventually returns us $3$ , which exceeds the current value of alpha, so we update the range to $(3, \infty)$ . Next, the second move is examined and this ends up returning $6$ , which again beats the current alpha. We update to $(6, \infty)$ .

Now lets consider what happens in the call to alpha_beta when we drop down into the third possible move. We will call alpha_beta(-beta, -alpha, depth_left-1), meaning the Min node starts with a range $(-\infty, -6)$ . (Remember that Min is maximizing over the negations of the real scores shown.) The first move Min looks at returns $-5$ (in her negated perspective). And this immediately causes a "fail hard beta-cutoff" - an early return from alpha_beta on account of one of the examined moves exceeding beta. Min found a move which was over the top of the whole range she had been given to work with.

So here's how we can think about beta's function in the algorithm. The variable beta has the meaning:

Beta: somewhere in my ancestor line, my opponent has an opportunity to play a move where they can get beta or less. (My opponent is minimizing relative to me). So if I can find a strong move that gives me better than beta in this node, I might as well stop looking. My opponent would never come down this line and give me the opportunity to do that, when they could pick that other move further up instead.

Back to the diagram. Let's now imagine that the first child Min looked at had value $-7$ (in her perspective), rather than $-5$ . In this case, the beta cutoff wouldn't have happened. Instead, this move would have raised alpha to narrow her window to $(-7, -6)$ . With no beta cutoff occurring, she must continue to look at the next child node.

At the child node, Max receives a window of $(6, 7)$ in his perspective. This is the first time we have seen alpha_beta called with a window where $\alpha \ne -\infty$ . As it happens, in this example Max immediately gets a beta cutoff from the first move. It returns $8$ , which exceeds his upper bound beta, and so he wouldn't need to bother looking at the next move (the one which would return $6$ ). This makes sense according to our interpretation above. Max can say "there is an ancestor node where my opponent can pick a better option for herself than coming down here, now that I know this node must be worth at least $8$ to me ( $-8$ to her); in fact, that ancestor node is my immediate parent - she can pick the first move which is $-7$ (for her)". Next, imagine that we had travelled a couple of nodes down the left-hand branch. The window flips around twice and so gets back to the same $(6, 7)$ . From this, we can see that beta really is tracking the best our opponent can do at any node in the ancestor tree - not just the immediate parent.

So we've looked in a bit of detail at beta-cutoffs and hopefully have a better understanding of what beta represents. What about alpha? Naturally it's closely related. In fact, because alpha and beta keep swapping roles at each recursive call - when we step down a ply - we can start to deduce how they are related. First note that alpha, playing the role of max in minimax by keeping track of the best child so far, is exactly what we want to give to successive child nodes as their beta. As discussed above, we need to inform those child nodes of our best option higher up. In other words, what we think of as alpha in our own node will influence beta at all the odd-plies below us, where it is our opponent's turn to move.

Alpha: somewhere in my ancestor line, I have an opportunity to play a move where I can get alpha. So even if I can't do any better than alpha at this node, I still want to make sure that I pass on the value of alpha to child nodes in my subtree, because they need to know about it. In particular, nodes at my opponent's turn need to know about alpha because it will be the benchmark for a fail-hard cutoff for them. (In a negamax framework, my alpha will be their beta, so they'll think of it is a beta-cutoff).

Taking this thinking further, we can actually see that alpha is representing the best outcome we are assured of - not just in this node, but in the entire tree searched so far. Essentially this is how the alpha_beta algorithm improves on minimax. In minimax, each node tracks its own best outcome in the variable max; with alphabeta, each node gets full information about the best outcome available for each player so far in the whole tree.

We can keep going in this vein and think of beta in similar terms. From our perspective as the player to move in a given node, beta represents the best outcome available for our opponent in the whole tree searched thus far. From our perspective, beta is coming down from above, while our own alpha is steadily increasing as we discover more favourable outcomes. Note that this interpretation of alpha and beta as a steadily narrowing window is easiest to imagine in a convention where Max is maximizing and Min is minimizing over the same set of values (i.e. not the "negamax framework", where the values keep swapping places and being negated).

Some vocabularly

There's a clutch of frequently used vocabulary associated with the alpha-beta algorithm which is much easier to understand now we have a more intuitive grasp over the fundamentals. Here's a little glossary:

Principal Variation - the single thread of moves through the game tree which represents optimal play by both sides. This is effectively the desired output from minimax, alphabeta etc. There is an algorithm called Principal Variation Search (PVS), which attempts to be an even more efficient improvement on alpha-beta, by focusing on the Principal Variation and searching other nodes with a deliberately narrow window, promoting faster cutoffs. A detailed discussion of this algorithm is outside the scope of this note, but there's more on Wikipedia and Chess Programming Wiki.
Fail-high - an alternative name for a beta-cutoff. Represents a situation where we discover that a node is too good for the player whose turn it is, meaning we can eschew searching further moves. Note that it would be an alpha cutoff for the minimizing player if we weren't in a negamax framework, which is another reason it has a distinct name.
Fail-low - this refers to the opposite scenario. We sometimes search through every move at a node and discover that none of them improves upon alpha. This node now fails to be part of the Principal Variation, but we had to do more searching to learn that. Since the value of this node failed to improve upon alpha, it's less good than some other move we already have elsewhere in the ancestor tree. Dropping down into this node from the parent is very good for our opponent - in fact, we have a better choice elsewhere so there will be a beta-cutoff in our parent node as soon as we return.
Fail-hard - looking at the pseudocode alpha-beta algorithm above, we can see that if a node fails to improve on alpha, we actually return alpha rather than the best available move (which was something less than alpha). This is known as a "fail-hard framework".
Fail-soft - in fact, you can make a small change to the algorithm and use a local max variable to track the largest value of the node, and return that instead (despite it possibly being less than alpha). This has no effect on the overall accuracy or output of the algorithm. A discussion on TalkChess goes over some of the pros and cons.
Node Types - the nodes of a game tree can be categorised into three types based on their behaviour during execution of the alpha-beta search algorithm. Here's a quick summary, focusing on intuition, for what each node type means. Note that these terms can be applied both to the idea of what the node is expected (or hoped) to be before searching it, and what it turns out to be after the search returns.
- PV node - nodes which have a score that ends up being inside the $(\alpha, \beta)$ window. For example, the leftmost branch of the tree comprises all PV nodes. In fact, the leftmost child of any node is a PV node. Confusingly, the actual Principal Variation returned at the end of the search may not be the same as these. It is merely hoped that the real Principal Variation is searched first because we want to search moves in order of strength to promote cut-offs.
- Cut-node - a cut-node is a node where a beta-cutoff happens. More interestingly, an expected cut-node is one where we hope for a beta cut-off to happen. If we really do manage to achieve good move ordering and search a very strong (PV) node first, all its siblings will be expected cut nodes. We anticipate that these nodes will be able to prune some of their children by virtue of having useful information available from when the PV sibling node was searched first.
- All-node - an all-node is one where all the moves get searched and none of them improves alpha. Its parent is necessarily a cut-node.

Summary

The alpha-beta window is tracking already-known information about optimal decisions the two players can make in this node, and ancestor nodes higher up the tree. Because we already know things about what they will be inclined to do higher up, we can narrow down the moves that are interesting to us in this subtree.
Alpha and beta are telling us about the optimal choices that Max and Min can make higher up the tree. If it's Max's turn, Max can already do as well as alpha somewhere higher up. Min can already do as well as beta somewhere higher up. As soon as Max finds a move in this position which is better than beta, we'd be wasting time looking at this node any further. Max would choose something at least as good as beta here. But Min, somewhere above this node, would make a choice that is more favourable for her. So who cares how high this node might be. As soon as it's higher than beta, it's definitely not PV.
Although we are maximizing and would like the highest score possible, beta constrains us. It says, "don't waste your time looking at crazy good moves here, because your opponent is not going to let you arrive at a node where you have such good options when they can make a more favourable choice higher up." So beta tells us how high it is worth looking at a node before it becomes unreasonably good.

Appendix

The Minimax principle

In a game like chess - I focus on chess throughout, because I know it best, but everything applies to many other games - we each have perfect information about the state of the board and the possible moves available to us. The game is zero-sum: if I win, you lose and vice versa. To work out the best action in a situation like this, you say something like "well, I can go here, here or here. But if I go here, he can checkmate me, and if I go here, he can also checkmate me. But if I go here, he gets three possibilities and no matter which of those he plays... I can checkmate him!" This "I go here, you go there" thinking is formalised in the game theoretic concept called minimax. The basic principle is that one player is attempting to maximize their own score while the other minimizes their own score (which is at all times the negation of the first player's score because of the zero-sumness). A diagram, borrowed from Wikipedia, should make this clear.

A minimax search tree. — A simple minimax game tree.

In this diagram, we have two players, Max and Min, and a range of numeric values across the leaf nodes at the bottom. We should interpret these as:

$+\infty$ : Max wins
$-\infty$ : Min wins
$> 0$ : Max has the advantage
$< 0$ : Min has the advantage

Side note: in theory, a search tree for a game like chess only has three values $-\infty, 0, +\infty$ , representing loss, draw, win for white. Every game must end at a leaf node with one of these three outcomes. In practice, the game tree is far too massive to search exhaustively so we assign real-number scores to nodes when we have gone as deep down the tree as we can. These scores are inexact - a heuristic assessment of who appears to be doing best based on centuries of accumulated chess knowledge.

Imagine that only the bottom row of the diagram is filled in with numbers. We want to work out what Max should do at the very top of the tree. The way to analyse this is to start at the bottom and steadily work up. Starting at the left of the first row up, we see that if this position were reached it would be Min's turn and she would be trying to pick the least bad of $10$ and $+\infty$ . That is of course $+10$ , so the node (i.e. position) is already worth $+10$ because Min, acting rationally in that position, will pick the first move of the two, taking the board to a position worth $+10$ . Continue in this way along the row. Then move a row up and do the same again, this time from Max's perspective who wishes to pick the maximum available from the alternatives he has available. Working up the tree we eventually get $-7$ for the initial position, and conclude that Min is winning. Max has two options, both bad, and he should play the second option - the least bad of the two.

Here's a typical implementation of minimax in pseudocode.

function maxi(depth) {
    if (depth == 0) return evaluate()

    max = -infinity

    for (move in legal_moves) {
        make_move(move)
        score = mini(depth - 1)
        unmake_move(move)

        if (score > max)
            max = score
    }

    return max
}

function mini(depth) {
    if (depth == 0) return evaluate()

    min = +infinity

    for (move in legal_moves) {
        make_move(move)
        score = maxi(depth - 1)
        unmake_move(move)

        if (score < min)
            min = score
    }

    return min
}

We have two implementations of the same idea, one for the maximizer, one for the minimizer. The function works recursively. If we reach a search depth of 0, we statically evaluate the current node and return that. Otherwise, we iterate through each available move in this position and call the 'opposite function' on the resulting position. As we get values back from each child subtree, we maintain a max variable (or min in the mini case), which tracks the most compelling option so far. At the end, we return whatever was the value of the best move for the current player. Everything propagates up the tree in this way and eventually the game is solved.

Note that in real life, nobody implements the algorithm in this duplicative manner. Instead one can write a single function with a judicious smattering of minus signs to produce the same result. This is obviously cleaner and more DRY, but somewhat harder to parse at first blush.

Improving efficiency with alpha-beta

Minimax works perfectly, and in fact, it returns us an exhaustive knowledge of the best move in every position on the whole game tree. But for games like chess (or indeed anything meatier than tic-tac-toe), the game tree is far too large to actually analyse in this way. Chess has a branching factor of about 35 and an average game length of c. 80 plies, so $35^{80} \approx 3^{123}$ is a rough estimate for the size of the tree. This is many more nodes than atoms in the universe. Even searching just a few ply down and using heuristic evaluations at leaf nodes presents us with an intractably large computational headache. The first big insight on our path to a strong game-playing engine is the alpha-beta pruning insight which notices that we can significantly reduce the number of nodes we need to search by pruning large subtrees out of the tree at zero cost to overall accuracy.

Once, again, a diagram courtesty of Wikipedia:

Let's again imagine traversing this tree from left to right. We'll descend down the leftmost branch all the way to the leaf $5$ . We would then step back up, drop down the next leaf branch to $6$ , and conclude the parent node (a Min node) has value $5$ . We then step back up and descend down the sibling, to another Min node. We iterate through its children, discovering first a $7$ and then a $4$ . And this is where we can make a nifty observation. That $4$ we just found is the child of a Min node, so the Min node must have a value of at most $4$ (for Min would never voluntarily pick something larger than she has to). But wait! The parent of the Min node is a Max node which we already discovered earlier has a child of value $5$ . So in that node, Max is guaranteed to pick the $5$ or something higher. So combining these two facts, we've already seen enough to know that any further searching of the Min node is pointless wasted time - it can't influence the value of the Max node above! Hence, in the diagram, we see the final child of the Min node (leading down to a $5$ ) pruned out. We don't need to search it at all! Try changing that $5$ to any other number at all and discover that whatever you set it to, the $5$ two plies higher (and indeed the whole tree above that) is unaffected.

So the key to alpha-beta is to be smart about using bounds as we search the tree - we don't typically need to know the exact value of every single node in order to solve the whole tree. Often we can just bound the value of a node and use that information alone at nodes higher up the tree.

A final point to make about the alpha-beta algorithm is the critical importance of move ordering. Pruning happens when we find strong moves in a position early. In fact, the biggest reduction in overall nodes searched vs. minimax will happen when we search moves in order of decreasing strength. Now, clearly, the whole point of the search process is to discover the strength of the moves, so this observation certainly attempts to put the cart before the horse. The practical implication is that it is critical to use heuristics to sort the moves in an estimated order of best to worst before traversing them in the tree. The alpha-beta algorithm becomes more of a proof to efficiently demonstrate that the presumed best move really is good. Most chess-playing programs will spend considerable time on heuristic analyses to guess good moves before searching them. This is time well spent when huge swathes of the search tree are pruned out, saving the algorithm from visiting tens of millions of nodes.

The alpha-beta algorithm

So how you do implement this? Where do our friends alpha and beta enter the fray? Here's some pseudocode:

function alpha_beta_max(alpha, beta, depth_left) {
   if (depth_left == 0) return evaluate()

   for (move in legal_moves) {
      make_move(move)
      score = alpha_beta_min(alpha, beta, depthleft - 1)
      unmake_move(move)

      if (score >= beta)
         return beta   // fail-high beta-cutoff

      if (score > alpha)
         alpha = score // alpha acts like max in minimax
   }

   return alpha
}

function alpha_beta_min(alpha, beta, depth_left) {
   if (depth_left == 0) return evaluate()

   for (move in legal_moves) {
      make_move(move)
      score = alpha_beta_max(alpha, beta, depthleft - 1)
      unmake_move(move)

      if (score <= alpha)
         return alpha // alpha-cutoff

      if (score < beta)
         beta = score // beta acts like min in minimax
   }

   return beta
}

One would invoke this at the root node by calling

score = alpha_beta_max(-infinity, infinity, 10)

The algorithm is clearly very similar to the minimax algorithm we started with. Now, however, we are passing around these alpha and beta variables. Evidently, they are tracking those bounds of interest and, at certain times, we return early from the function - a so-called "fail-high cutoff".

Negamax version

For completeness, here's the algorithm modified to the "negamax" form I alluded to above. We don't really want a different function for each perspective when we can achieve the same thing with some minus signs. This approach can be a bit confusing at first, but it's certainly cleaner.

function alpha_beta(alpha, beta, depth_left) {
   if (depth_left == 0) return evaluate()

   for (move in legal_moves) {
      make_move(move)
      score = -alpha_beta(-beta, -alpha, depthleft - 1)
      unmake_move(move)

      if (score >= beta)
         return beta   // fail hard beta-cutoff

      if (score > alpha)
         alpha = score // alpha acts like max in minimax
   }

   return alpha
}

Hopefully by comparing with the previous version, you can see that this is really the same thing. Everything is now done as if both players are maximisers. Whenever it is Min's turn, she works with the negation of the real score, which makes her a maximizer. Slightly less obviously, our slippery pals alpha and beta get swapped back-and-forth in the recursive call to alpha_beta. This is because they are being used like a range or window on the number line, and when we negate everything, the top end of the range becomes the bottom and vice versa. For the main discussion about what the variables alpha and beta represent, return to the top.