Bill James famously had his “favorite toy,” a sort of career projection formulation aimed at seeing the chances that someone had of reaching a certain milestone in a counting stat. And that’s sorta neat. But it’s not my favorite toy. My favorite toy is something I call the Lineup-based Out Projector (LOP), a set of formulas I use to predict the distribution of when outs are made in a game.
What I mean is this: There’s a 30 percent chance the first inning ends after your third hitter, 50 percent after your fourth, 12 percent after your fifth, etc. (I just made these particular numbers up.) And my toy does this not just the first inning, but for the whole game. So, if you give me a lineup, my favorite toy will give you the probability breakdown of when each inning will end. Which is a bit of a novelty really, except … you can use that information to do some pretty gnarly things.
Obligatory mathy section
All we need to do this are the expected on-base percentages of the nine players in your lineup, as well as the order in which the players will bat. Let’s define some variables. I’ll use p to denote a batter’s OBP, and p1 to denote the OBP of the first guy in the lineup, p2 for the second guy’s OBP, etc. I’ll call the probability of the nth guy making an out on any given plate appearance, (or 1-pn, which is just 1-OBP), qn.
Okay, so what we want to do is start at the beginning. What’s the distribution of who records the first out of the game? Well, at first glance, that seems really simple. The probability that it’s the first guy is just q1, that it’s the second guy is p1*q2 (or, the first batter’s OBP times the second batter’s out-making percentage), that it’s the third guy is p1*p2*q3, etc.
Except that if we do this all the way down, you might notice that the probabilities don’t quite add up to 1. The reason for this is that there is some probability (an exceedingly small one with typical levels of offense) that every single guy in your lineup will get on base, no outs. Or that they will do it twice, or three times, or… it goes on to infinity actually.
That’s okay though. We can deal with the infinity. The thing to realize is that the chances of having 10 guys in a row get a hit and then have the 11th guy make an out are exactly the same as having one guy get on and then the next one make an out … except we need to multiply by the OBPs of all the previous batters in your lineup. And having 19 guys get on and then the next batter record an out is also basically the same thing, but now we have to multiply by all of the on-base percentages twice.
So let’s define some more variables here. Let L be the product of all the OBPs, i.e. L = p1*p2*p3…*p9. Then, to account for batting around any integer number of times from 0 to infinity, we need to sum L^0+L^1+L^2… If we remember our calculus (yeah, thought you’d never use it again, didn’t you?), this sum works out to just be 1/(1-L), which is a number I call S.
Now, because L is the product of all these probabilities that were decently small to begin with, we get a value for S that’s barely more than 1, but then, how likely is it that we have nine straight guys reach base? Anyway, if we multiply the simple probabilities we calculated above (i.e. the p1*q2, the p1*p2*q3, etc.) by S, we get the precise values. And we can do a check on yourself by summing up the probabilities of the out being recorded after every position in the lineup—if we’ve done it right, it will come out to one.
After we get the probability distribution for where the first out occurs starting with the first batter in the lineup, we want to duplicate the process starting with every other batter in the lineup as well. This is because the next thing we want to do is combine these distributions with each other to work out what happens after the second out.
The probability that the second out happens after, say, the third player in the lineup is the probability that the first out happened after the first player and (i.e. multiplied by) the probability that starting with the second player, the next out would happen after the third player, OR (i.e. plus) the probability that the first out happens after the second player and starting with the third player, the next out happens after the third player, or the probability that the first happens after the third, and starting on the fourth the next out happens after the fourth, etc. etc. etc.
We end up with a sum of nine products, each being the multiplication of two of the two probabilities we found for the first out up above. Then we want to do this starting in every possible lineup position and ending in every possible position—81 probabilities total, just like for the first out.
Next, we combine the one out and the two out distribution in the same way to get the distribution for three outs—one inning. Then we can combine a one-inning distribution with another one inning distribution to get the distributions for two innings, one inning with two innings to get three innings, etc. etc. until we’re satisfied—I personally have it done through nine innings, though at this point it would take about five minutes to get it through 20.
I plugged in a pretty average-looking lineup to see what’s going on. First, I found pretty quickly that there are really big probability spikes for the first out, smaller through two, and the more outs you add, the more the probability spreads out. You’d expect that if you went enough innings into the future, the probabilities would spread out to pretty much be even. But if you play around with it a little, you see that this predictability effect is more pronounced the less balanced your lineup is in general, and the further each of your OBPs is from .5 in particular.
Another interesting tidbit you can grab from the LOP—I think this is potentially the most useful thing—is to figure out where your innings are starting. By adding up the probabilities of the previous innings ending with the previous batter (plus one for the first inning in the case of your leadoff hitter), you get the expected number of innings started by each position in the lineup.
It seems to me that there should be some way to make this very useful in working out how to optimally construct lineups for getting runs across, though precise ways of doing this are proving somewhat elusive to me. In any case, there are some interesting things you can see for typical lineups. As you’d expect, the batter in the one-hole leads off the most innings, but you might not expect that he tends get somewhere around a full extra inning leading off compared to everyone else. The guys in the two- and three-holes lead off the fewest innings by a good bit, and the four and five hitters tend to be getting the most chances to start innings after the first.
But the most straightforward application of the LOP is to work out how many of each kind of event your team is getting in the course of the game. To do this, you need to use the LOP to work out how many plate appearancesAs you expect from each batter, and that requires more math.
Set up a new variable called X that will be equal to the number of PAs that the last hitter in the lineup gets over the course of a game. The eighth hitter will obviously get X PAs, plus one extra PA every time he ends the game (a probability that we can get from the LOP). The seventh hitter will get what the eighth guy got plus one for every time HE ends the game, and so on. We also know the OBPs for each hitter, so if we multiply the PAs we expect for that hitter, we get the expected number of times he reached base, and more importantly, if we multiply (1-OBP) by the expected number of PAs for each hitter, we get the number of outs we expect him to make.
So in order to find X, keep playing around with it until you get the number where the sum of all players’ expected outs hits 27. Once you get that, you have the number of PAs you expect for every player, and you can use his profile to work out how many of those PAs are home runs, doubles, walks, outs, etc.
This data can be useful in all manner of applications, but of course the most straightforward is to predict the number of runs you’re going to score in a game. For this, you want to use whatever run modeler you think is most appropriate. For example, you can use this to calculate your TEAM wOBA (which is different from just averaging the wOBA of all the players—you have to weight for the amount of PAs each hitter gets!), and then convert that to runs per PA, and multiply by the total PAs of your team to get an estimate of runs scored.
(I want to take a second here to note that this isn’t going to necessarily predict the highest amount of run-scoring for putting your highest wOBA guys earlier in the lineup, if lower wOBA guys have higher OBPs; runs scored is based off wOBA*PA, not just wOBA). Or you can use it for your favorite version of BaseRuns or Runs Created or whatever other model you’ve cooked up for yourself or just like to use.
Now, I do have some misgivings about this. A big issue with many of these systems is that they treat baseball as sequencing-neutral, i.e. you’re just as likely to have a home run and then a walk as you are to have a walk and then a home run. Now some of this gets masked, because usually there’s some part or parts of the system that are determined empirically, and whatever that is, it’s reflective of how real teams really construct their lineups and other un-homogeneities. But at some level, they still have sequencing neutral, and the whole point of what we did here with the LOP is that sequencing is NOT neutral; we have a batting order, and gives us more data, and we should use that!
Potentially, we can create a new model off the data you get here too, since we know not only how many of each kind of event we’re getting, but also from what spot in the order we are getting it. This is the ultimate goal, and I’ve done a little bit of work on it, but it’s only a very little amount, as it’s very difficult to start from scratch, and I don’t have anything to test it on. Still, I’m reasonably confident that it’s possible.
To come up with an example, I just picked a random team lineup—in this case, I grabbed the Blue Jays’ lineup from June 3, 2011 as a good, not particularly special American League team—and plugged in each player’s season stats as my projection. Again, if you’re looking to do accurate analysis, you’d want to use some kind of super-accurate projection for each hitter, based on the park, the opposing pitcher, etc. But for the purposes of the example, the point will get across anyway.
The first thing to do after getting all the guys’ stats in place is to figure out how many PAs everybody gets. I always start with a guess around four, since league average OBP generally gets you somewhere around there.
Fiddling around with the numbers until I got 27 total outs, I found that the last guy in the lineup (Jayson Nix) would have expected to come up a hair over 3.95 times per game, with the overall PAs of the team totaling just under 39.5. This, of course, means 12.5 guys reaching base, of which 3.73 walked, 5.61 singled, 1.85 doubled, 0.24 tripled, and 1.08 homered. Of the nine innings, we expect 1.89 to be led off by the leadoff hitter, Yunel Escobar. The next-most are the 0.99 innings started by the five-hitter, J.P. Arencibia. The two-and three-hitters, Corey Patterson and Jose Bautista, get the fewest chances to lead off, coming in at 0.77 and 0.79, respectively. Having tried this out on a few other teams, I can say that these results are pretty typical.
How about a National League team? I checked out the Cubs’ lineup from Aug. 1, 2011. The pitcher’s spot ends up coming up only around 3.79 times per game here (you can already see the effects of the lower offense right there), and with only around 37.9 PAs for the whole team, we can also see that there are about 1.6 fewer base-runners for these Cubs than the Jays had. Every single positive offensive event drops off from the Blue Jays, too, from the .08 fewer home runs we expect to the 1.17 fewer walks (singles, doubles and triples drop off much more in line with the homers).
We also see an exaggeration of the effect of the leadoff guy leading off innings, as we’d expect Starlin Castro to be due up first in 1.95 of the nine innings, more than one inning above every other player except the number five hitter, Marlon Byrd, and the four-hitter, Carlos Pena, who clocked in at second with just over one time leading off per game. Again, we see the second and third hitter in the lineup starting innings significantly less often than everyone else, clocking in at only 0.78 and 0.74 times leading off, respectively.
Of course, I would be remiss to not mention the limitations of the LOP. The first and most important thing is that it is a slave to the projections you feed in for each player’s OBP. You need this projection to be for the particular game you’re projecting, not the whole season or a career. Let me repeat that, you need to use GAME projections rather than SEASON projections.
This means you want to adjust for ballpark and defense and weather and all those other little things, but most importantly for the opposing pitcher. The reason that this is important is that season projections work on the aggregate, but we’re not projecting the aggregate here. Thus, if you have some enormous platoon split for a hitter and he gets on at a .250 clip against righties but a .400 clip against lefties, that might aggregate out to an OBP of .300 for the season (if he faces twice as many righties as lefties). But in an individual game a 30 percent chance to end after batter one, 20 pedrcent after two, and 50 percent after three versus a 45/15/40 will definitely not average out to ending most often after the second batter!
There are some other issues as well. Changing probabilities over the course of the game messes the model up—so things like changing temperature and the times-through-the-order effect, but most importantly pitching changes, are going to cause errors. You can potentially get around these by recalculating the remainder of the game using new probabilities for the last parts, i.e. if the pitcher changes after the sixth inning, I do the first six innings based on my projections against the starter, then when I would go to combine innings by adding in the next one, I add in a new inning that’s based on the reliever’s stats instead.
Of course, this is a bit more work to do and rather problematically assumes you know when the pitching change will happen, but I guess simulators have the same problem. The LOP also doesn’t handle double plays very well at all. You can theoretically get things like caught stealing to work by rolling them into outs, but it’s a bit hard, once again, to correctly situationalize things.
But if you understand its limitations, the LOP can be another nice tool in your arsenal to tackle analysis with, and a really fun toy to play around with.
References & Resources
Click here to download an Excel file of the math.