A few weeks after building the database, I was at the SABR Analytics conference in Phoenix and found I had a problem. A conversation with Mitchel Lichtman, co-author of The Book: Playing the Percentages, turned toward the Rockies, then toward the ideas I had for weather research. He said I should check the data because they might be unreliable. He had heard the weather readings might not even be taken from the stadium itself.
So research into the validity of the data began. In the Retrosheet database, box score weather has been recorded in 10-30 percent of the games from 1950 to 1987. In 1988, there were attempts to report it more regularly, with individual years like 1992 reporting a temperature for each game. However, even as recently as 1994, 24.7 percent of major league games didn’t have weather data. In 1996, only six percent of the games didn’t have temperature, and since 1998, only a Toronto Blue Jays game on April 13, 1999 is missing a box score temperature.
I reached out to Mark Pankin from Retrosheet, and he confirmed Retrosheet weather data are taken from the official box score weather. I knew Baseball-Reference also had weather data in its game logs and invaluable Play Index tool, so I reached out to site founder Sean Forman to see if he had some insight, but he also was unaware of where the box score weather readings were taken.
Undaunted, I then contacted the Rockies, who helpfully directed me to the official scorers for the games at Coors Field. They pointed out what looked to me like a weather vane at the top of the center field bleachers and said it had been put there “at least five years ago” by a company called WeatherBug.
I also learned that the weather as reported on the Coors Field scoreboard is actually from Denver International Airport, which is outside the Denver “valley” over 20 miles away. Now I know the scoreboard temperature shown in Coors Field isn’t taken at Coors Field…but the box score weather might still be good.
I reached out to WeatherBug, which is now known as Earth Networks and got details. Before the 2007 season, MLB and Weatherbug were able to install weather instruments, known as anemometers, in or near every ballpark. Besides temperature, wind speed and direction, the anemometers also take readings on humidity, dew point, and a number of other weather-related factors.
Of the 30 teams, 25 had the anemometers installed within the park. The five parks that did not are Wrigley Field, (Chicago Cubs), Guaranteed Rate Field (Chicago White Sox), Citi Field (New York Mets), Citizens Bank Park (Philadelphia Phillies) and AT&T Park (San Francisco Giants). As of now, SunTrust Park, the new home of the Atlanta Braves, does not have readings taken there, though the data were captured at their old home of Turner Field. Below is a list of where the reading is taken from for each team.
|Angels||CA||Angel Stadium of Anaheim|
|Astros||TX||Minute Maid Park|
|Blue Jays||ON||Rogers Centre|
|Cubs||IL||The Cubby Bear|
|Mariners||WA||KING5 at SAFECO Field|
|Mets||NY||IS 61 Leonardo da Vinci|
|Orioles||MD||Oriole Park at Camden Yards|
|Phillies||PA||Lincoln Financial Field|
|Rangers||TX||Globe Life Park in Arlington|
|Red Sox||MA||Fenway Park|
|Reds||OH||Great American Ball Park|
|White Sox||IL||St. Jerome School|
I’ll admit I got a chuckle out of the fact the Cubs official box score weather comes from atop a tavern while the White Sox get their weather from the roof of a school. The Phillies get their weather from the Eagles football stadium across the street, while the Giants and Mets get their weather readings about two miles away with a number of buildings in between that might affect temperature or wind readings.
Overall, the data for 24 of the stadiums since 2007 are reliable, and Turner Field is valid for 2007-2016, in terms of readings being taken using the same instruments installed and maintained by the same company. Prior to 2007, how the data recorded varies on a park-by-park basis. Most stadiums used airports or nearby military bases since they have precise weather stations that frequently update out of necessity. Just like the Coors Field scoreboard, these weather stations are often used as the “official” city weather in national weather forecasts, including television and radio broadcasts even if they are located far from the metropolitan area of the city.
Even for one team in one ballpark, though, the weather station can change. The Rockies’ inaugural season at Mile High Stadium in 1993 was played before Denver International Airport opened, so the weather in its box scores would have been taken from a different weather station than the subsequent 1994 season. Similarly, I have to assume the weather sources for other ballparks also have changed multiple times since box scores started reporting weather. Prior to 2007, the box score weather data may be unreliable from year to year without knowing what weather station was used for specific time periods.
However, the temperature information in box scores since 2007 should be pretty similar to the conditions in the ballpark even if the anemometer is not located within the ballpark. Wind information is much more problematic, even if readings are taken from inside the ballpark. As shown by Andrew Perpetua in a previous article at The Hardball Times, modelling weather, wind speed and direction taken on the roof of a ballpark may be dissimilar to how the wind speed and direction play out on the field. Furthermore, wind can change with different ballpark features, and may not be constant over time, as physicist David Kagan has pointed out previously here at THT as well.
Keep in mind that the box score weather is taken at the start of the game, so what the weather is like when the sun sets or by seasonality can vary. It’s also possible a game that starts with calm winds can gust later in the game. According to Earth Networks, hourly data with a six-month history is available, with more information possible, by special request. However, even if it was possible to obtain weather by the minute, it would be difficult to sync up the exact conditions to the exact split second a batted ball occurred. The box score data are usable to an extent, but a lot of caveats can apply.
Still, the data can suggest some insights and avenues for future research. Since Retrosheet data is available down to the individual batted-ball event, it is possible to look at frequencies of hit type, walk and strikeout rate by game-time temperature, and one can make some looser assumptions on wind data. As an example, in an article I wrote for Purple Row, I found the Colorado Rockies have a higher winning percentage at Coors Field when the game-time temperature is between 45 and 54 degrees and greater than 69 degrees, but tend to lose in between those conditions.
I also saw that the Rockies tend to get a more extreme benefit from the warmer temperatures in terms of run scoring than their opponents at Coors Field, outscoring their opponents by about a run per game once the game-time temperature gets above 80 degrees. Still more research needs to be done on why that’s the case, but box score weather can give an indication of new areas to explore.
This is just a sampling of some of the ways the data can be analyzed. The box score weather data also is outputted online through MLB’s At Bat application and anywhere that tracks real-time box scores. This led to a fun game on the recent Rockies home stand, which featured a variety of weather conditions. Once the temperature was announced, I’d guess on who was likely to win and how many runs would be scored. Weather being as finicky as it is, it’s definitely a leap of faith to make guesses on an individual game’s outcome, but it’s been a fun one to take.
I hope that as more people probe the data now that they have a better idea of its validity, and hope it will open more avenues of research. Meanwhile, remember that for most ballparks, the game-time box score weather might be more accurate than what’s actually on the scoreboard.