About Those Stats…by Dave Studeman
May 02, 2004
The Hardball Times is mostly about good baseball writing. But, just for fun we like to throw in numbers and graphs too. This is baseball, after all, where words and numbers go together like a superstar and his entourage. Can't have one without the other.
So we formed a partnership with an excellent baseball statistics company, Baseball Info Solutions, to bring you stats on a regular basis. The folks at Baseball Info Solutions score every major league game, feed it into their database, and send it to us every night. It's not "play by play" data, but it's great stuff.
When publishing these stats on the Web, we like to follow a few simple rules:
- Present the best, most insightful stats -- those that best tell the story
- When possible, use stats that are easy to understand and explain
- Present the stats in a concise manner that's easy to interpret.
- Experiment, take suggestions, keep what works and add new ideas
With these rules in mind, here's a review of the stats we've published so far (using examples through games of April 30th).
For our first example, let's look at one of the few bright spots up in Seattle:
PLAYER TEAM RSAA IP ERA RA FIP DER LD% K/9 BB/9 HR/9 SLG Garcia F. Seattle 11 35.3 2.29 2.29 0.09 .727 .161 6.4 2.3 0.5 .331Freddy Garcia has rebounded nicely so far this year, even though he's 0-1. It's not Garcia's fault that the Mariners have only scored 11 runs in his five starts. In fact, we think won/loss record is so irrelevant, we haven't even bothered to post it yet.
The pitchers on each team are ranked according to RSAA, which stands for Runs Saved Above Average. This may be a new stat to you, but it's been around for awhile and it's relatively simple to understand. It's the number of runs a pitcher has allowed in his innings pitched, compared to the league average. If the number is positive, he's pitched better than the league average -- and vice versa for negative numbers. Garcia actually leads the league in RSAA at 11.
I believe this stat was "invented" by Pete Palmer a long time ago (he called it "Pitching Runs"), but it's now published by Lee Sinins and others, and it's a very effective way to measure a pitcher's impact on his team. You should note that it's based on total runs allowed, not just earned runs allowed, and we adjust it for ballpark factors so you can legitimately compare pitchers across the league (see the park factor section below).
Innings pitched, ERA and RA (total runs allowed per nine innings) are all pretty straightforward. Although Freddy Garcia has not given up any unearned runs, some pitchers have given up a lot of them. Count me among those who believe that pitchers should be held at least somewhat accountable for the unearned runs they allow. By presenting both stats together, we allow you to see the whole picture.
FIP stands for Fielding-Independent Pitching, and it's my own personal favorite pitching stat. It was inspired by Voros McCracken's DIPs work (which is extremely insightful but complicated), and invented by Tangotiger (who found the simple math behind the concept). The idea is to take those things that a pitcher is solely accountable for (and that fielders don't affect) and add them together in terms of their overall run impact.
The formula is:
(13 times HR plus three times (walks and HBP) minus two times strikeouts)/(Innings Pitched)
FIP is essentially that proportion of ERA for which a pitcher can be held solely accountable. If you add 3.20 to his FIP, you'll get an approximation of what his ERA would be with an "average" defense behind him. For instance, Garcia has a 0.09 FIP, versus a league average of 1.34 (more than one run/game below league average). If he had an "average" defense behind him, his ERA would be around 3.29. His actual ERA is 2.29, which indicates that he and his fielders are also doing well with batted balls.
Speaking of batted balls, the next stat is DER, which stands for Defense Efficiency Ratio. This is the number of playable batted balls that the fielders successfully turned into outs. The exact formula I use is:
(Batters faced minus hits, walks, HBP and strikeouts)/(batters faced minus home runs, walks, HBP and strikeouts)
For team DER, I subtract 60% of errors from the top number, in order to estimate the numbers of batters that the fielders allowed to reach base via an error.
FIP and DER tell you a lot about the sources of a pitcher's success and failure. He may do the "pitcher-only" thing well, but his fielders might be letting him down. Or vice versa. Or both. Together, FIP and DER paint a fairly complete picture.
However, it's misleading to say that DER is just a reflection of the team's fielders. DER is a function of a lot of things: fielders, ballparks, groundball/flyball ratios and line drives allowed (not to mention plain old luck). So to help paint the picture a bit more, we've added a stat called LD% (line drive percent).
In general, Baseball Info Solutions classifies all in-play batted balls as groundballs, flyballs and line drives (they also classify bunts separately). These are just what they sound like: groundballs hit the ground, flyballs tend to leave the bat on an upward path, and line drives come straight off the bat. These are not indications of how hard the ball is hit, but the trajectory of the batted ball. For instance, most home runs are flyballs, not line drives.
Still, a line drive is more likely to become a hit than a groundball or a flyball. To show what I mean, take a look at the LD% and DER of the White Sox's four top pitchers:
See, as the LD% goes up, the DER goes down. I chose these four pitchers because they've pitched in the same ballparks with the same fielders, have pitched at least 20 innings and have similar GB/FB ratios. If any of these factors were different, the graph would change. LD% is an important part of the equation, but not the only part.
Taking a look at Garcia, his DER is relatively high (league average is .685) and his LD% is relatively low (league average is .180). So he deserves some credit for that low DER. Also, his GB/FB ratio is exactly 1.0, which means that he's a bit of a flyball pitcher, which also helps lower his DER.
The final stats are for informational purposes: each of the three FIP factors are broken out per nine innings, allowing you to see each pitcher's relative FIP strength. Finally, we've thrown in pitcher's Slugging Percentage Allowed, just because it's hard to find this stat anywhere else. Let us know if you find it useful.
Onto batting statistics. For this example, let's look at a guy who has gotten off to a very good start:
Player Team RC RC/G PA OUTS BA OBP SLG ISO GPA P/PA LD% BA/RISP Lo Duca P. Los Angeles 19 11 85 48 .416 .447 .545 .130 .363 3.66 .286 .471Paul Lo Duca has been key to the Dodgers' fast start, and these stats show it. He leads the team in Runs Created, BA, GPA and several other stats. But I'm getting ahead of myself.
Remember how I said that we prefer simple stats? Well, I guess there's an exception to every rule, and Runs Created is ours. Bill James invented Runs Created over 20 years ago as a pretty simple calculation, but it's gotten very complicated ever since. I won't lay out the entire formula, but suffice to say that it includes OBP, SLG, stolen bases, hitting in the clutch and all the various ways of making outs (including double plays and times caught stealing). It's also been adjusted for ballpark factors. As you can see, Runs Created has evolved into a complicated, but fairly complete, assessment of a batter.
Having said that, this choice of a "run estimator" is just a tad controversial. A lot of work has been done on run estimators, by guys like Tangotiger, MGL, David Smyth, US Patriot, among others. For this edition of stats, I chose Runs Created because it is well known, it's a part of Win Shares (more on that in a moment), and I'm most familiar with it. But I have a feeling you'll be seeing other run estimators from us in the future.
RC/G is Runs Created per 27 outs that the batter has created. It takes 27 outs to complete a game, and that's the theory behind this stat -- it's the theoretical number of runs this batter would create if he got to bat every time. This is also a controversial statistic, though a familiar one. Watch for different versions of "run estimator rate" stats as the year goes on, too.
PA is the number of times the batter has appeared at the plate, and Outs is outs -- including outs due to hitting into double plays and being caught stealing. Just looking at the ratio of outs to PAs is a useful way to evaluate a batter.
Next, you've got your rate stats: BA (Batting Average), OBP (On-Base Percentage), SLG (Slugging Percentage) and ISO (Isolated Power, which is SLG minus BA). These are all useful stats, and they highlight different strengths of each batter. For instance, Lo Duca's primary value is his high BA (second in the league); he hasn't taken a lot of walks, and his ISO is below the league average of .165.
Just for fun, we've decided to eschew OPS for now (you can probably just figure it out in your head) and run with a different stat called GPA, or Gross Production Average.
The math behind GPA is simple:
(1.8*OBP plus SLG)/4
The math accomplishes two things. It weights OBP more strongly than SLG (because OBP is worth more. Ask Paul DePodesta for details) and it converts the stat into a scale very similar to BA, in which .200 stinks, .300 is very, very good and .400 is almost unheard of. We think this makes it a simple, useful stat, and we make it more useful by adjusting it for ballpark factors.
Over at Baseball Prospectus, they have an article about OPS, in which they make the argument that you need to go beyond OPS to get a good grasp of a batter's contribution. We agree, and we think GPA does that, similar to Baseball Prospectus' EqA.
The last three stats on the batting line are pitches taken per plate appearance, LD% (computed in the same manner as in the pitching section -- most relevant for interpreting batters whose primary value is in their Batting Average) and BA/RISP, which is Batting Average with Runners in Scoring Position. If you believe in clutch hitting, this is a better way to measure it than RBIs, because it is not dependent on guys getting on base in front of the batter. As you can see, Lo Duca has been hitting a lot of line drives (consistent with the high BA), and batting very well with runners in scoring position.
I've always liked the way Bill James described Park Factors in his "Win Shares" book: "If you think you know how to do it, you don't understand the problem." Park Factors are a complicated issue, and we've taken a slightly different approach to them.
One-year park factors are unreliable, in my opinion. Did you know that Colorado pitchers allowed more runs on the road last year than at Coors? Is there really any good reason for that, other than an outright fluke? I don't know, but I doubt it. So we've computed multi-year park factors for each park (up to five years for those that have been around that long) and regressed the park factors to the mean, based on how much data we have. In other words, we take the long view, with large sample sizes.
Okay, we didn't do it ourselves. We based our park factors on a spreadsheet that US Patriot has kindly made available at his site. Thank you, Patriot.
Next, we have applied those park factors to each team, based on the actual ballparks that team has played in so far. So each team's park factor is specific and dynamic throughout the season. With today's interleague games and unbalanced schedules, we think this is the best approach to park factors. Hopefully, it makes our adjusted stats (Runs Created, GPA, RSAA and FIP) a bit more valid.
I've been asked if we'll be calculating in-season Win Shares, and the answer is yes. Give me a couple of weeks to work out the kinks in the system.
And we're working to make our stats section even better. In addition to the full tables we've already got, we'll have leaderboards, so you can find the players whose performances are most outstanding (or most embarassing, if that's your cup of tea).
We'll also have a "geek's paradise" report, listing many of the more esoteric stats that you can't always find elsewhere; we realize that most readers don't want to wade through all of that stuff, but some do (and don't worry, we will explain all of those measures in the same detail you find here). And we'll have all of this updated every night, so you can have completely up-to-date numbers with your morning cup of coffee.
References and Resources
There are so many good sources of baseball statistics on the Web, I don't even know where to start. Here are just a few ideas:
Major League Baseball's official site is full of stats and definitions.
Our friends at Batter's Box have a nice review of statistical websites.
And Baseball Prospectus has been running a fine series on the basics of statistics.
Dave was called a "national treasure" by Rob Neyer. Seriously. Comments about this article can be sent to him through the miracle of e-mail.