Any press, as they say, is good press.
Recently I had a chance to get my hands on the new Baseball Prospectus book, Extra Innings, which—like their previous book, Baseball Between the Numbers—has a series of chapters, each of which dealing with a central question.
This isn’t a book review, just an analysis of one chapter. Veteran BP writer (and recent Bleacher Report addition) Steven Goldman penned a chapter titled “How can we evaluate managers?” This really gets my attention, as I wrote a book titled Evaluating Baseball’s Managers. The similarity in titles between my book and his chapter isn’t a simple coincidence. This chapter is largely a response to my book.
The short version is that Goldman disagrees with much of my book. Though he states that I wrote “a fine book” that does an excellent job depicting the characteristics of the various managers it profiles, there is a fundamental disagreement between my book and his chapter. Ultimately, I wrote a book titled Evaluating Baseball’s Managers, and Goldman is quite a bit more pessimistic about our ability to do just that, evaluate baseball managers. So yeah, there are points of contention.
Criticisms I agree with
I’ll be the first to admit Goldman makes some valid critical points. Three in particular should be noted. First, Goldman notes that I misconstrued the conclusion made by former Prospectus writer (and current Tampa Bay Rays brain) James Click in a study he did for the Prospectus book, Baseball Between the Numbers, that examined a managers’ ability to improve individual batter performance. I stated that Click says managers have no impact on it.
Goldman corrects me, saying that Click’s study couldn’t show managers affected player performance, but that’s not the same thing as saying that it doesn’t exist. This distinction has been noted in sabermetrics since at least as far back as Bill James’ understanding the fog, but I botched it.
That said, looking at it again, I notice that Click slipped a bit and also said, “managers show no consistent ability to improve batter performance.” Yeah, his main conclusion is what Goldman states, but that line sounds more like it doesn’t exist than we can’t prove it exists.
Second, Goldman justifiably calls me out on some mistakes made in my Casey Stengel section. I argued that Stengel’s Yankees were very unsentimental in their treatment of aging players, moving them out once they could. However, Goldman notes that some of my examples made of this were actually made despite, not because, of Stengel. Fair enough, good point. Should I ever update the book, I’ll remember this.
Third, Goldman notes one thing I all but state on my own in my book: I engage in confirmation bias. I entered this study with an assumption that managers have an impact on player performance and present some (admittedly not conclusive) statistical evidence of it based on some data called the Birnbaum Database.
Goldman quotes me stating, “I do not believe in limiting myself to mathematical rationales. This evidence [the Birnbaum Database] beautifully corresponds to long-lasting and widely held notions that managers can and do have an impact on player performance. I therefore accept it.”
Goldman flatly states: “This is textbook definition of confirmation bias; Jaffe excepts (sic) the results because they conform his beliefs, not because they are illustrative of reality.” I plead guilty. But Goldman presses his point. It ain’t either or. The results could both conform to my beliefs and be illustrative of reality.
Less surprisingly, I have areas of disagreement with Goldman. Let’s look at some main issues/themes of disagreement.
Many of the disputes between myself and Goldman can be seen as micro versus macro. Some of the arguments Goldman makes against my book made a lot of sense on the micro level. But the problem is, my book was trying to focus more on the macro level.
Let’s look at probably the most important dispute, when Goldman takes on the Birnbaum Database. First, let’s back up and explain the Birnbaum Database, which inspired my book. It’s an attempt to project how a player should have performed in any given season based on his real-life performances in the two preceding and two succeeding years.
Goldman criticizes this, and his criticism has a core of truth, especially on the micro level. He is absolutely right when he writes “players exceed or fall below projections for many reasons: injury, a happy new marriage, a nasty divorce, a taste of one of the magical elixirs in the PED cabinet. None of these elements reflect in any way on the manager.”
Any time you’re looking at just one player, any variation in projection tells you about the player—and virtually nothing about the manager. How could it? There are too many other factors muddying the waters; factors including injuries, divorces, marriages, and magic elixirs.
But I noted that in my book. Heck, before daring to introduce any results from the Birnbaum Database, I spend several pages discussing its flaws and limitations. A lot of the issues, especially the ones Goldman raised, can be minimized (albeit never perfectly solved) using sample size.
Let’s look at Earl Weaver. The Birnbaum Database rates his pitchers at +409 runs better than expected and his hitters were +183 runs. That places him among the best managers in history. This is a man who lasted about 2,500 games as a manager. Does it really sound reasonable to presume that his players were just that much more happily married than all opposing clubs? Doubtful. Was 1970s Baltimore some early and unknown haven for PED usage? Color me skeptical.
Sample size, it comes in handy. There’s a reason why the book focuses on managers with longer careers. They’re the ones where the numbers are less distorted by outside factors.
Look, these outside factors Goldman notes still distort things. The numbers on Weaver aren’t perfect. Nor are they for Tony La Russa or John McGraw or anyone. My book never claims they’re perfect. It’s just that imperfect isn’t a synonym for useless. Goldman’s comments are valid with regard to individual players, but the larger the sample size, the less valid those comments becomes.
For that matter, even with the increased sample sizes of a larger career, there are still times I have serious reservations with regards, to the numbers. In fact, most of the managerial commentary on Don Zimmer and Terry Francona focuses on how I disagree with the Birnbaum Database.
One last note along these lines. Goldman also throws in injuries as a sign of how players might miss their projections. Yeah, on the micro level, that’s true. But what happens if, say, a manager has a whole ton of pitchers go down in injured tatters on his watch? Doesn’t that tell us something about a manager? If not, a hell of a lot of people owe Dusty Baker (among other managers) their apologies. For that matter, if a manager has a track record of players staying healthier than normal under his watch, that can be a sign he knows how to take care of them.
Mind you, the Birnbaum Database isn’t the only time sample size issues emerge. Near the end of the chapter, Goldman presents several examples of how it’s difficult to evaluate managers, and much of this is also dependent on small sample sizes. Goldman mentions how former Orioles manager Hank Bauer once engineered a trade for Billy Williams only to be vetoed by his bosses. Goldman also mentions Earl Weaver and one year his players took control of in-game management on their own—and the team got better.
Look, if you start at the micro level trying to analyze each specific decision or even each individual season, you’ll go crazy trying to evaluate managers. You have to take the long view. I think getting the bigger picture first works better. Then, once you find trends, you can use specific examples to characterize the manager, but don’t start with specifics. Then you’ll never see the forest for the trees.
Department of huh?
Sometimes I didn’t quite get what the criticism was, or that it read like a criticism but I didn’t see how it was.
Let’s give an example, going back to what Goldman says about Click and me. Some background: Click wrote a chapter on managers in Baseball Between the Numbers that featured a one-page study I mentioned earlier on a manager’s impact on players. The rest of the chapter focuses on other matters such as baseball strategy and in-game decisions.
First, Goldman quotes my objections to Click’s study and then offers his response. After noting that I misstated Click’s conclusion Goldman writes:
The key to making an argument about any subject, be it managers or murderers, is to present evidence. In the case of managers, a skipper who possessed the skill to consistently alter some attribute of his clubs would manifest that ability consistently and in a way we could document. While many managers do tend to mold their teams in certain characteristic ways over time (something that Jaffe’s book excels in demonstrating), there is no evidence that over time managers can have more than a small positive influence on the outcome of a given contest or season in terms of his on-field impact, be it through his tactical choices or in some Svengali-like effect that hypnotizes batters or pitchers to perform in a way that was dramatically different than they might have otherwise. This is distinct from how a manager might positively influence a club through the way he shapes the work environment or psychologically affects certain players, but these are impossible to pin down statistically.
There’s a lot in that quote that made me scratch my head. First—and this is easy to miss—the focus actually shifts at the outset. Please remember that just before this section, Goldman was recounting my dispute with Click’s study on managerial impact on individual batters. Here, immediately after implying I lack evidence (more on that in a second), Goldman talks about in-game strategy (“his tactical choices”).
Wait, the issue I had with Click’s study wasn’t about tactics, but about impact on individual batters. That’s coaching, not tactics. There’s a lot on tactics in Click’s chapter, but that wasn’t what the dispute was about.
Well, that bit about tactical choices was just a clause in a longer sentence. Yeah, but the rest of the sentence—the part about “some Svengali-like effect”—left me wondering, too. Specifically, it seemed like the very next sentence contradicted it. When I read the Svengali line, I figured Goldman was doubting that managers can have any meaningful impact on his players behind the scenes, either in terms of coaching or psychologically. However, Goldman’s very next sentence says managers can have a positive psychological impact.
But there’s still one big “huh?” left from near the top of the quote. What’s this about lacking evidence? Actually, I provided some evidence that the Birnbaum Database shows managers can impact player performance. I’ll be the first to admit it’s not conclusive, but proof and evidence are different words.
I divided all baseball games into those managed by men who lasted 2,000 or more games, 1,000-1,999 games, 500-999 games, or 499 or fewer games. Then I ran those four groups through the Birnbaum Database and saw if the results looked more like luck or managerial skill. If they were luck, you’d expect the managers who lasted over 2,000 games to have the most average score. After all, luck should even out over time, and a minimum 2,000 games is a long time. If it’s skill, you’d expect the 2,000s to be best. The results provided evidence of managerial skill to improve (or worsen) player performance.
So, immediately after discussing my dispute with Click, Goldman 1) says I don’t provide evidence (even though I do have some non-conclusive evidence), 2) brings up in-game tactics (which aren’t related to the debate), 3) doesn’t seem to think managers meaningfully help players when it comes to the human side of managing, but 4) then says they can meaningfully help individual players when it comes to the human side of managing. Huh?
There are other parts that just struck me as off. In fact, the time Goldman first mentioned my book seemed a bit off. Goldman said I wrote my book in response to Click’s Prospectus work.
Wait, what? That’s news to me. Phil Birnbaum first presented his work at a 2005 SABR convention, and that was my big inspiration, not Click. I’d already finished my first wave of research before coming across Click. Frankly, Click’s study was always something of a weird adjunct upon work I’d already done. I knew I had to come to discuss it, but it was off my focus.
There’s one last bit dealing with Click. (Maybe this is too much on Click, but since Goldman implies Click inspired my book, I feel the need to clarify some points). In my book, I noted Click’s study and commented: “I have an admitted bias: I believe managers matter. To convince me otherwise would take more than an equation, no matter how brilliant its math. I need a clear and coherent argument based on thoughts instead of double regression studies and metrics. It takes words, not numbers, to convince me otherwise.”
Goldman calls me out on this, saying “Jaffe tries to have it both ways, insisting that managers are about human interactions and not equations, but then offering his own equations in defense of managers.”
Here’s the thing: I wasn’t opposed to a mathematical formula for managers, just that I’d like more than just a formula by itself. Click just spent a page presenting his study, stating the results from his R-squared test and moving on. If you’re going to convince that Click’s right, I need more than the math. It’s not either/or. It’s not math or an explanation, but math and an explanation. Click only gives the math, so I have trouble being convinced.
In my work, I tried to have both: give numbers, and give numbers I could explain and understand. To be fair, reading what Goldman quotes makes it sound like I was saying its either/or.
There are other points I could mention that don’t deal with the Click section, but some of it is ticky-tack stuff, and this section has gone on long enough. The Click stuff had me the most bewildered.
Dealing with uncertainty
One theme I found particularly jolting and that caught me off guard is Goldman’s belief that managers do have an impact on the way players perform. He even goes so far as to flatly state: “The human element of managing is everything” (italics in the original).
First, I agree. Second, the main appeal of the Birnbaum Database for me was that it put some (imperfect) numbers on the above. But that’s why I was so caught off guard. Steve Goldman is not the first person to criticize my work or methodology. But most others who do it would never write the sentence above about the human element.
I’ve come across two main responses to my book. Some refuse to belief the human element matters unless I can offer inarguable proof it does. I can’t, so they have no interest in my work. Others agree that it matters, and those people generally go along with the Birnbaum Database. With Goldman, we have someone who agrees that the human side of managing is vital but wants nothing to do with the Birnbaum Database.
Ultimately, I think it boils down to how much uncertainty you’ll tolerate. You can never perfectly isolate managers, but I think the Birnbaum Database gives a good idea of what impact they have (given a large enough sample size).
Look, Steve Goldman is a terrific writer who has consistently done excellent work for years. He makes some good points, but ultimately I still stand behind Evaluating Baseball’s Managers.
References & Resources
Steve Goldman’s chapter, “How Can We Evaluate Managers,” is on pages 232-256 of Extra Innings.
James Click’s chapter, “Is Joe Torre a Hal of Fame Manager,” runs from pages 139-156 in Baseball Between the Numbers. Click’s study, “Improving Individual Batter Performance,” is on pages 152-53 of that chapter.
Almost all the information from my book, Evaluating Baseball’s Managers, that Goldman disputes comes from the first chapter. That’s where the Birnbaum Database is debuted and where I mention Click. The parts on Casey Stengel come in Chapter Seven, on pages 190 to 195.