​
​
Sign In
  • Support FanGraphs
    FanGraphs Membership
    FanGraphs Shirts
    FanGraphs Mugs
    Gift a Membership
    Donate to FanGraphs
  • Fantasy
    Fantasy Tools
    Fantasy Player Rater
    Auction Calculator
    Ottoneu Fantasy Baseball
    Signup, FAQ, Blog Posts
  • Blogs
    Blog Roll

    FanGraphs
    • FanGraphs Weekly Mailbag: June 28, 2025
    • Effectively Wild Episode 2341: The Second Half Has Started
    • New York Mets Top 45 Prospects
    • Position Players Pitching Is Back on the Rise
    Podcasts: Effectively Wild

    FanGraphs Prospects

    RotoGraphs
    • Big Kid Adds (Week 13)
    • The Sleeper and the Bust Episode: 1430 – 2-Start Pitchers for Week 15
    Podcasts: The Sleeper and The Bust | Field of Streams | Beat the Shift

    Community Research
    • Effectively Wild's Preseason Predictions Game Update: Ben Clemens

    Archived Blogs: The Hardball Times | NotGraphs | TechGraphs | FanGraphs+
    Archived THT: THT Live | Dispatch | Fantasy | ShysterBall
    Archived Podcasts: FanGraphs Audio | Chin Music | UMP: The Untitled McDongenhagen Project | Stealing Home | Doing It For Bartolo | OttoGraphs |
  • Projections
    2025 Pre-Season Projections
    ZiPS, ZiPS DC
    Steamer
    Depth Charts
    ATC
    THE BAT, THE BAT X
    OOPSY
    2025 600 PA / 200 IP Projections
    Steamer600, Steamer600 (Update)
    2025 Updated In-Season Projections
    ZiPS (RoS), ZiPS (Update), ZiPS DC (RoS)
    Steamer (RoS), Steamer (Update)
    Depth Charts (RoS)
    ATC DC (RoS)
    THE BAT (RoS), THE BAT X (RoS)
    OOPSY DC (RoS)
    3-Year Projections
    ZiPS 2026, ZiPS 2027
    On-Pace Leaders
    Every Game Played, Games Played %
    Cy Young Award Projections

    Auction Calculator
  • Scores
    Today
    Live Scoreboard, Probable Pitchers
    Live Daily Leaderboards
    Win Probability & Box Scores
    2025, 2024, 2023, 2022, 2021, 2020, 2019
    AL Games
    ATH (7) @ NYY (0)Final
    MIN (5) @ DET (10)Final
    TBR (4) @ BAL (0)Top 3
    SEA (0) @ TEX (0)Bot 2
    LAD (0) @ KCR (0)Top 3
    TOR (0) @ BOS (3)Top 2
    STL (1) @ CLE (0)Bot 2
    SFG (0) @ CHW (0)Top 3
    CHC @ HOU19:15
    WSN @ LAA21:35
    NL Games
    NYM (1) @ PIT (0)Top 2
    SDP (0) @ CIN (1)Mid 2
    MIA (2) @ ARI (1)Mid 2
    COL (0) @ MIL (2)Top 3
    PHI @ ATL19:15
  • Standings
    2025 Projected Standings
    2025 Playoff Odds, Playoff Odds Graphs
    2024 ZiPS Postseason Game-By-Game Odds
    AL East
    Yankees47350.0
    Rays46361.0
    Blue Jays44372.5
    Red Sox40437.5
    Orioles354611.5
    AL Central
    Tigers51320.0
    Guardians40409.5
    Twins404210.5
    Royals384412.5
    White Sox265624.5
    AL West
    Astros49330.0
    Mariners42396.5
    Angels40418.5
    Rangers40429.0
    Athletics345116.5
    NL East
    Phillies48340.0
    Mets48350.5
    Braves374410.5
    Marlins354512.0
    Nationals344814.0
    NL Central
    Cubs48340.0
    Brewers46362.0
    Cardinals45383.5
    Reds43395.0
    Pirates335015.5
    NL West
    Dodgers52310.0
    Giants45376.5
    Padres44377.0
    D-backs414010.0
    Rockies186433.5
  • Leaders
    Major League Leaders
    Batting: 2025, 2024, 2023, 2022, 2021, Career
    Pitching: 2025, 2024, 2023, 2022, 2021, Career
    Fielding: 2025, 2024, 2023, 2022, 2021, Career
    Major League Leaders - Rank
    Batting: Ranking Grid, Compare Players, Compare Stats
    Pitching: Ranking Grid, Compare Players, Compare Stats
    Splits Leaderboards
    Pitch-Type Splits Leaderboards
    Season Stat Grid

    Postseason Leaders
    Batting: 2024, (WS), (LCS), (LDS), (WCS), Career
    Pitching: 2024, (WS), (LCS), (LDS), (WCS), Career

    Spring Training Leaders
    Batting: 2025, 2024, 2023
    Pitching: 2025, 2024, 2023

    KBO Leaders
    Batting, Pitching
    NPB Leaders
    Batting, Pitching

    Minor League Leaders
    AAA: International League, Pacific Coast League
    AA: Eastern League, Southern League, Texas League
    A+: Midwest League, South Atlantic League, Northwest League
    A: California League, Carolina League, Florida State League
    CPX: Arizona, Florida
    R: Dominican Summer League
    College Leaders
    Batting, Pitching

    WAR Tools
    Combined WAR Leaderboards
    WAR Graphs
    WPA Tools
    WPA Inquirer
    Rookie Leaders
    Batters 2025, Pitchers 2025
    Splits Leaders
    Batters: vs L, vs R, Home, Away
    Pitchers: vs L, vs R, Home, Away
  • Teams
    Team Batting Stats
    2025, 2024, 2023, 2022, 2021, 2020
    Team Pitching Stats
    2025, 2024, 2023, 2022, 2021, 2020
    Team WAR Totals (RoS)
    AL East
    Blue Jays  |  DC
    Orioles  |  DC
    Rays  |  DC
    Red Sox  |  DC
    Yankees  |  DC
    AL Central
    Guardians  |  DC
    Royals  |  DC
    Tigers  |  DC
    Twins  |  DC
    White Sox  |  DC
    AL West
    Angels  |  DC
    Astros  |  DC
    Athletics  |  DC
    Mariners  |  DC
    Rangers  |  DC
    NL East
    Braves  |  DC
    Marlins  |  DC
    Mets  |  DC
    Nationals  |  DC
    Phillies  |  DC
    NL Central
    Brewers  |  DC
    Cardinals  |  DC
    Cubs  |  DC
    Pirates  |  DC
    Reds  |  DC
    NL West
    D-backs  |  DC
    Dodgers  |  DC
    Giants  |  DC
    Padres  |  DC
    Rockies  |  DC
    Positional Depth Charts
    Batters: C, 1B, 2B, SS, 3B, LF, CF, RF, DH
    Pitchers: SP, RP
  • RosterResource
    Current Depth Charts
    AL East
    Blue Jays
    Orioles
    Rays
    Red Sox
    Yankees
    AL Central
    Guardians
    Royals
    Tigers
    Twins
    White Sox
    AL West
    Angels
    Astros
    Athletics
    Mariners
    Rangers
    NL East
    Braves
    Marlins
    Mets
    Nationals
    Phillies
    NL Central
    Brewers
    Cardinals
    Cubs
    Pirates
    Reds
    NL West
    D-backs
    Dodgers
    Giants
    Padres
    Rockies
    In-Season Tools
    2025 Closer Depth Chart
    2025 Injury Report
    2025 Payroll Pages
    2025 Transaction Tracker
    2025 Schedule Grid
    2025 Probables Grid
    2025 Lineup Tracker
    2025 Minor League Power Rankings
    Offseason Tools
    2025 Free Agent Tracker
    2025 Offseason Tracker
    2025 Opening Day Tracker
  • Prospects
    Prospects Home
    The Board
    The Board: Scouting + Stats!
    How To Use The Board: A Tutorial
    Farm System Rankings

    Top Prospects List
    20252024
    AL
    BALCHWATH
    BOSCLEHOU
    NYYDETLAA
    TBRKCRSEA
    TORMINTEX
    NL
    ATLCHCARI
    MIACINCOL
    NYMMILLAD
    PHIPITSDP
    WSNSTLSFG
    2025 Preseason Top 100
  • Glossary
    Library
    Batting Stats
    wOBA, wRC+, ISO, K% & BB%, more...
    Pitching Stats
    FIP, xFIP, BABIP, K/9 & BB/9, more...
    Defensive Stats
    UZR Primer, DRS, FSR, TZ & TZL, more...
    More
    WAR, UBR Primer, WPA, LI, Clutch
    Guts!
    Seasonal Constants
    Park Factors
    Park Factors by Handedness
  • Sign In

But I Regress…

by Dave Studeman
January 4, 2007

Do you know that thing that statisticians do called regression analysis? It’s when they look at two (or more) numbers to determine how closely correlated they are. To use a couple of examples I’ve seen recently, education is correlated with health and the presence of a Led Zeppelin bumper sticker is correlated with the likelihood of that vehicle containing a controlled substance like marijuana. I first learned regression analysis back in the days when you had to compute it by hand; now all you need is a computer with Excel. It’s a neat tool, perhaps a bit too easy to use for some.

But something’s always bugged me: why is it called regression analysis? Why isn’t it called correlation analysis? I mean, when you run a regression analysis, the main output is the correlation between the variables, right? So why is it called regression? Huh? Haven’t you wondered the same thing? Even once?

Okay, perhaps you’re not as geeky as I am. But you’ll be happy to know that I think I found the answer while reading a biography of the guy who invented regression analysis, Sir Francis Galton.

Galton was an amazing, quirky guy; one of those classic Victorian gentlemen with lots of time on their hands and lots of things to discover. He traveled the Nile and explored parts of Africa that hadn’t been seen by white men before. He published a book on survival in the wild, parts of which are still included in survival guides. He invented some silly things (one of my favorites: the gumption-reviver machine, which simply dripped water on you until you were thoroughly soaked) and some very important things (weather maps; the system for categorizing fingerprints still used today). Most of all, he counted things.

Galton was an obsessive counter. He determined a precise formula for preparing the perfect cup of tea. He counted beautiful women in different parts of England to deduce his own “beauty map.” And when his cousin, Charles Darwin, invented a little something called evolution, he threw himself into the task of counting hereditary traits.

He was convinced that things like criminal behavior, intelligence and genius were linked to heredity. His beliefs stood in contrast to many of his critics, who also cited environment. In fact, it was Galton who first turned the phrase “nature/nurture” to describe the argument. Along the way, he decided the best thing to do would be to collect statistics on people and measure them. So he set up shop in a Public Health exhibition and asked people if they would like to be measured (height, armspan, breathing capacity, eyesight, etc.). After a year, he had collected measurements on over 10,000 people.

Statistics was still in its infancy, and Galton certainly didn’t have a computer back then. But he decided to analyze these numbers as best he could. He took the heights of 205 sets of adults and their children and (much to my delight) laid them out in a scatterplot graph. He saw that the points moved together: the taller the parents, the taller the children. However, the points didn’t line up perfectly.

So he drew a line that seemed to best fit the relationship between the points, and measured its slope. The result was two-thirds. As Galton thought it through, he realized that children were two-thirds as likely to be as “extreme” as their parents. He called the remaining one-third “regression.” Actually, he called it “regression to mediocrity,” which we have modified to regression to the mean.

This was actually a blow to Galton, who wanted to believe that heredity was absolute. But it was a huge step forward for the field of statistics. Galvin went on to refine his technique, developing correlation coefficients and lots of other things. But the very first thing he noticed, the thing that the graph showed him, was regression. And that’s why we call it regression analysis. I think.

Regression to the mean is everywhere in baseball. Sophomore slump? Regression to the mean. Seattle’s 93-69 record after going 116-46 in 2001? Regression to the mean. Luke Scott’s Slugging Average in 2007? Regression to the mean.

Let me show you another graph. This graph plots batting average in 2005 and 2006. What I’ve done is to split up the 2005 batters into quartiles, and then plotted how those same batters performed in 2006. I used a minimum of 300 at bats in 2005 and included the player in in both years if he played in 2006 at all. This is what regression to the mean looks like:

image

As you can see, each one of the four quartiles moves closer to the average (that gray line) in 2006. The first quartile of batters batted .305 in 2005 and .294 in 2006. The lowest quartile batted .245 in 2005 and .263 in 2006. Each group moved closer to the mean.

There is probably some selection bias in that lower quartile. The worst batters played less in 2006, which skews the overall results higher. So regression to the mean isn’t quite as strong as it appears in that lower quartile, but it’s still pretty strong.

What we’re really after is understanding the difference between a player’s “true talent” and the overall league average. The problem is that one year isn’t enough data to establish a player’s true talent. So let’s see what happens when we include two year’s batting average (2004 and 2005) in the initial quartiles:

image

If you compare the two graphs, you’ll see that the lines aren’t as steep when you have two years’ worth of data to begin with. In this case, the first quartile moved from .303 in 2004/05 to .295 in 2006, a little less than the one-year sample. The bottom quartile migrated from .252 to .262, a lot less than the one-year sample. If you have more years in your baseline, there is less regression to the mean.

Why do I bring this up now? Because lots of people are producing forecasts for the 2007 season, and one of the first things every decent projection system will do is regress a player’s performance to the mean. In fact, there is one system that does nothing other than regress each player’s performance to the major league average as a basis for its 2007 projection. It’s called Marcel, because it’s so simple that even a monkey can do it. (Marcel, from Friends. Get it?)

A Hardball Times Update
by RJ McDaniel
Goodbye for now.

You can read more about the Marcel system from its current caretaker, Tangotiger. Tango’s specific calculations are laid out in this thread—he essentially takes each player’s previous major league performance and regresses it to the mean. That’s it; no park adjustments, minor league stats or anything like that. The amount to which he regresses each player depends on how long the player has been in the majors. If he’s only been in the majors a year or two, Tango regresses his performance a lot. He also regresses a pitcher’s performance more strongly than a batter’s, because pitchers are typically more random.

Chone/Sean Smith found that Marcel had a .66 correlation with batters’ actual performance last year. The best correlation he found was PECOTA’s, at .74. Nate Silver of Baseball Prospectus has worked tremendously hard to make PECOTA a cutting-edge system and has succeeded. But even his model only gains a smidgen of accuracy over Marcel. That is the power of simple regression to the mean.

You can download the 2007 Marcel projections from Tango’s site. Just for the heck of it, I downloaded them and compared them to each player’s 2006 performance. Here is a list of the batters who are most likely to see an increase in their batting average, based on Marcel and regression to the mean (minimum at bats in 2006: 300. Minimum batting average in 2006: .240):

Last      First       06BA     mBA    Diff
Gonzalez  Luis A.     .242    .285    .043
Cantu     Jorge       .249    .281    .032
Izturis   Cesar       .245    .276    .031
Ellis     Mark        .249    .278    .029
Mueller   Bill        .252    .279    .027
Duffy     Chris       .255    .281    .026
Kubel     Jason       .241    .266    .026
White     Rondell     .246    .271    .025
Crisp     Coco        .264    .289    .025
Casey     Sean        .272    .296    .024
Lopez     Javy        .251    .276    .024
Peralta   Jhonny      .257    .280    .024

In general, you won’t see many predicted improvements for first- or second-year players, because there’s not enough history to regress to. But Cleveland fans should feel good about seeing Jhonny Peralta on this list.

Here’s a list of players whose batting averages are most likely to decline next year:

Last      First     06BA     mBA    Diff
Redmond   Mike      .341    .291   -.050
Scott     Luke      .336    .292   -.044
Bard      Josh      .333    .293   -.041
Ozuna     Pablo     .328    .290   -.038
Ward      Daryle    .308    .269   -.038
Cirillo   Jeff      .319    .281   -.038
Jones     Chipper   .324    .286   -.037
Helms     Wes       .329    .293   -.036
Coste     Chris     .328    .294   -.034
Jeter     Derek     .343    .311   -.033

You shouldn’t really be surprised by any of the players on this list. Let’s switch to On-Base plus Slugging Average (OPS). Here’s a list of players most likely to improve next year by regressing to the mean:

Last     First      06OPS    mOPS    Diff
Clark    Tony        .643   0.826    .183
Gonzalez Luis A.     .625   0.764    .139
Guillen  Jose        .674   0.800    .126
LaRue    Jason       .663   0.763    .101
Peralta  Jhonny      .708   0.803    .095
Lee      Derrek      .842   0.934    .092
Cantu    Jorge       .699   0.789    .090
Lopez    Javy        .683   0.767    .084
Hermida  Jeremy      .700   0.782    .082
Niekro   Lance       .673   0.754    .082
Crisp    Coco        .702   0.783    .081
Varitek  Jason       .725   0.806    .080
Navarro  Dioner      .687   0.767    .080

Here’s a list of players most likely to decline:

Last       First      06OPS    mOPS    Diff
Scott      Luke       1.047   0.872   -.175
Ward       Daryle      .926   0.782   -.144
Ross       Dave        .932   0.788   -.144
Helms      Wes         .965   0.831   -.134
Dye        Jermaine   1.006   0.879   -.128
Thome      Jim        1.014   0.900   -.114
Beltran    Carlos      .982   0.875   -.107
Anderson   Marlon      .866   0.765   -.102
Bard       Josh        .926   0.826   -.100
Saenz      Olmedo      .927   0.828   -.099

Is Marcel saying that each of these players will regress to the mean? Absolutely not. Some of them won’t. But enough of them will regress to the mean to validate the entire approach. Marcel doesn’t predict breakout seasons; by definition, those are nearly unpredictable. It predicts what you can most likely expect from a player.

Projection systems start with regression to the mean, but they differ significantly in what they regress to. Marcel simply regresses to the overall major league average (with one exception for pitchers in the American League), while PECOTA regresses to the average of similar players (based on height, weight and other things). As another example, this thread includes a fine discussion of how to regress players who have only been in the majors a year or two.

Sir Francis Galton would be proud of the way baseball fans and analysts have incorporated regression to the mean in their thinking. I can also think of a few players who could use that gumption-reviver machine.

References & Resources
The biography of Galton is called Extreme Measures: The Dark Visions and Bright Idesa of Francis Galton by Martin Brookes. The New Yorker reviewed the book a couple of years ago.

Correlation and regression analysis were a tremendous contribution to mankind, but Galton’s other legacy is the field of eugenics. Galton envisioned eugenics as a utopian way to build the best human species. In his conception, eugenics was relatively innocent and naive. Adolf Hitler turned eugenics into a nightmare.

I want to credit John Burnson’s 2006 Graphical Pitcher for the graphical inspiration of regression to the mean. John used it to show the extreme regression to the mean of home runs per fly balls among pitchers.


Dave Studeman was called a "national treasure" by Rob Neyer. Seriously. Follow his sporadic tweets @dastudes.

Comments are closed.


Updated: Saturday, June 28, 2025 10:59 AM ETUpdated: 6/28/2025 10:59 AM ET
@fangraphs - Contact Us - Advertise - Terms of Service - Privacy Policy
sis_logo
All major league baseball data including pitch type, velocity, batted ball location, and play-by-play data provided by Sports Info Solutions.
mlb logo
Major League and Minor League Baseball data provided by Major League Baseball.
Mitchel Lichtman
All UZR (ultimate zone rating) calculations are provided courtesy of Mitchel Lichtman.
TangoTiger.com
All Win Expectancy, Leverage Index, Run Expectancy, and Fans Scouting Report data licenced from TangoTiger.com
Retrosheet.org
Play-by-play data prior to 2002 was obtained free of charge from and is copyrighted by Retrosheet.

Support FanGraphs
Become a Member

Please consider becoming a FanGraphs Member. All the great work that you've come to rely on is made possible by Member support, including analysis, stats, projections, RosterResource, prospect coverage, and podcasts.

Membership starts at $.16 a day.

Already a Member: Log In

Sign Me Up