Introducing Markov chainsby John Beamer
November 26, 2007
In less that a week you’ll be able to get your grubby mitts on the finest offseason baseball publication. Yes, it’s the 2008 Hardball Times Annual. I can safely say that the 2008 edition of our Annual is the finest incarnation yet. There are a ton of great articles from the regulars here at THT, as well as some of the best names on the web such as Bill James, Tom Tango and John Dewan. It is a must buy.
There is also one other reason to buy the Annual. Over the years I have been (haphazardly) developing a Run Modeler that works out the run value of each base out state, the linear weight value of each event and run frequency distributions, among other things. This forms the basis of an article in the book, and in the spirit of sharing we have decided to let everyone who buys the book download a copy of the Run Modeler for gratis.
For the more technical among you the Run Modeler is based on Markov chains, takes into account both batting and non-batting plays, is written in Excel, and weighs in at a very healthy 20Mb—actually the fully blown custom LI version tips the scales at 200MB! And, if I say so myself, it is an awesome application. For the bandwidth shy among you it compresses to 4MB when zipped.
The rest of this column outlines how the Markov works; why it knocks the socks off other run estimators like Runs Created and BaseRuns; and some pitfalls of my particular implementation.
Markov Chains, BaseRuns, Runs Created
At its most simple the game of baseball can be boiled down to a few, trivial parameters: number of outs, bases occupied and the probability of the hitter getting to either first, second, third or home. Based on these parameters it is possible to describe precisely any game of baseball.
Funnily enough these are the same parameters that Markov chain uses to estimate runs. Given the start states, end states and probability of moving between these start and end states the Markov chain can be used to describe a game of baseball in ANY run environment. More common run modellers such as Runs Created or BaseRuns try to estimate these parameters with a series of equations. To understand why the Markov works we need to understand run creation in more depth.
Tom Tango has written the seminal work on this subject a few years ago, and I strongly encourage you to read his three-part series on creating runs. For those with less time here is a quick summary. Run scoring in baseball has two components: getting on base, and moving over. However, a model like Runs Created isn't founded on such tight logic and falls over fairly easily.
Consider a really tough hitting environment, say one hit a game; not many runs are going to be created. In fact, unless that one hit is a home run it will take a lot of innings to score a run. That environment is obviously an extreme, but it causes problems for Runs Created. This is because Runs Created only works in an OBP range of around .200 to .400. Even approaching those extremes gives nonsensical results. Consider the case when there are 20 hits per nine innings. Runs Created tells us we’ll get 11 runs per game; in fact, the answer is nine runs per game—a 20% difference.
The reason is that there is little logic behind the RC formula. Bill James was experimenting with run models and serendipitously stumbled across RC, which magically worked rather well in the major league context. However, transport it to another environment, like softball, and it is immensely flawed. So, what is the solution?
BaseRuns (BsR) was invented some years ago by David Smith and creates a logical framework around run scoring so that it works in most environments.
The basic tenet of BsR is that the equation models run scoring. The identity is:
BsR = batters reaching base x runner score rate + HR
Unlike RC, we can see that BsR models the game far more accurately. If we analyze our 20-hit game under the lens of BaseRuns we get close to nine runs per game (8.8 to be exact). The main reason that BaseRuns beats Runs Created is because it treats the home run as a separate event. In a high-run environment the home run is less valuable than it is in a low-run environment because runs are a less precious commodity. For instance, in a game where there are no outs, the home run is equal in value to a single. Any run modeler worth its salt should treat home runs separately to other events.
Deriving the actual BsR equation in terms of hitting events is a little tricky, often involving trial and error, but good implementations do exist. BaseRuns is without question the most simple and accurate run model, but that doesn't mean it is perfect. Tom Tango has shown that the BaseRun equation breaks down in certain circumstances, for instance in the .500 to .800 OBP levels, and also in very low walk-only run environments. Also the fact that the derivation is largely down to trial and error isn't wholly satisfactory.
That leads us on to the perfect run modeller, namely a Markov model. Let’s delve into the analytical engine to see how it works.
Baseball is a final state Markov process. As hinted at earlier we define a start state and end state in terms of bases and outs (heck, we could make it more granular and include count if we wanted to) and work out the probability of transitioning between various start and end states. Mathematically it involves a shed load of matrix manipulation and multiplication—hence the 20MB behemoth.
I won’t bore you with the technical details on Markov as that can be found elsewhere on the web—here for instance. It is probably useful to walk through a simple example to help explain the logic.
Imagine a hitter, Sammy Single (I know, you can’t make these names up), who only hits singles and has a .300 batting average. At the start of an inning the bases are empty and we know that he makes either a hit or an out. That means from the one start state (0 on, 0 out) there are two end states (on first, 0 out; 0 on, 1 out). There is a 30% chance of the former (a hit) and a 70% chance of the latter (an out)
Take the first end state, where there is a man on first and 0 out. If Sammy Single comes to bat again then a number of things can happen (ignoring errors and steals):
- Batter makes an out and runner stays where he is
- Batter makes an out and runner advances
- Batter makes a hit and runner advances to second
- Batter makes a hit and runner advances to third
- Batter grounds in to a double play
Based on the probability of the batter making a hit and the location of the hit (whether it is in the infield or not), we can very accurately work out the probability of each end state. If we repeat this exercise for all possible start and end states and do some fancy mathematical trickery we can work out the probability of each base out state being occupied, the number of runs that score, on average, from each base out states, and the linear weight value of a series of offensive events.
The THT Markov
Okay, so I’ve impressed you with the intricacy of the Markov engine, but the more important question is: what can you do with it?
There are two versions of the Markov model included with the book allowing differing levels of control and refinement. There is a simple version, where you can change batter inputs and some very basic base running events, and the more complex version where you can precisely specify base advancements on hits and outs, outs on base (pick-offs) and errors. I also provide some fine tuning options to further increase the accuracy of the model—more on these later.
At its most basic, THT Markov model allows the user to look at run environments based on either a team or a single player. For instance, there is facility to specify nine different batters so that the run environment of a team can be analyzed. Or, there is the option to look at a single batter. This allows us to see the difference of, say, how a team of Albert Pujols’ would stack up against one Pujols and eight other Cardinals.
As I hinted above, the other component of the model is to flex the base running game. For instance, if you want to see the run impact of a sacrifice fly, say, you can adjust the frequency with which runners take a base on an out and how successful they are. Or if you want to assess the run impact of pick-offs or errors you can change those parameters accordingly.
So is it perfect?
By definition no model is perfect—however, I believe that this Markov model is the most accurate run model publicly available and it offers the user an unparalleled degree of flexibility. There is only one other Markov application on the web and that is Tom Tango’s simple Markov. Incidentally, if you replicate his set-up parameters in the THT model (using the simple base running model) you get identical results, which is a good check.
So, what are the potential flaws in the THT Markov model? There are several. First, an assumption is that there is no situational hitting, which is probably the biggest source of error. What do we mean by this? Well, we know that the LWTS of an IBB is less than that of a BB because an IBB tends to be issued in less threatening situations. The same is true of normal walks. A hurler is more likely to pitch around Alex Rodriguez with the bases empty than the bases loaded. All walks are not issued equally.
Similarly, if the bases are loaded the hitter will probably try to shorten up his swing and make contact rather than aiming for the fences, which is what he might do with the bases empty. At the very last minute I did add in some tweaking by situation (for instance the user can tweak where and when walks are issued), but it is by no means perfect.
Second, in order to keep the lid on complexity a ton of minor offensive events have not been included, for instance balks and hit by pitches. These events occur relatively infrequently and modelling them in would not make that much difference to the results.
Third, building in GIDP and base stealing is a challenge and requires a creative shortcut. GIDP can only happen in certain base-out states. In my implementation it only happens with fewer than two outs and a runner on first, although it is possible to model other GIDP outs with the other base running parameters. Anyway, GIDP opportunities depend on the other offensive events. For instance, a lot of home runs imply fewer GIDP opportunities.
To calculate this number we need to run the Markov, but the Markov itself changes according to GIDP so we get ourselves in a tiz (the iteration is divergent). There is of course a short cut, which involves looking at the number of singles and walks (a proxy for number of times there is a runner on first) but for extreme GIDP environments this is a source of error. The same is true of base stealing, although the approximation is slightly different. In the implementation there are manual overrides if the user wishes to put in the analytically more precise value.
Finally, related to the point above, the model produces silly output if you specify nonsensical inputs. For instance, if you have a team that has 20 hits and 20 GIDP then that screws the model. Why? Because 20 hits and 20 GIDP implies that hits are not distributed randomly, which is a core model assumption. The model does work in the most extreme environments as long as the assumption that events are independent holds.
All on-base events such as pick-offs, out on base, stolen base distribution, error distribution, advances on hits and advances on outs are based on real 2000-2005 play-by-play data.
One question I often get is why did I develop this mother in Excel and not in some fancy programming language?
Originally this started as a very small project and as time went on it grew and grew. At various points I have thought about developing a proper program but I don’t really have the time. Also, I believe it gives me much more flexibility over how the model works. Because you can see the transition matrix on screen it is a lot easier to add in new features, for example, modelling in errors that it would be in code. Also, it is much easier to capture data from the web and slap it in Excel that it is to input it into a proprietary program—at least for me it is.
Bring on the 2008 Annual
If you've managed to read to this point I'm seriously impressed. Delving into Markov chains is about as geeky as you can get in baseball. The good news is that you have to delve no more. Next week make sure you get your copy of the 2008 THT Annual and you can play with your own perfect run modeler.
References and Resources
Thanks for Mark Pankin and Tom Tango for all their work on run estimators and Markov chains. Without their work this project would not have been possible.
John is an unashamed glory supporter having followed the Atlanta Braves since 1991. He blogs the Braves at Chop-n-Change. He welcomes comments, criticisms and suggestions via e-mail