Tuesday, May 6, 2014

Part 1: How to Predict Postseason Success in Baseball



Just how did the Red Sox get past the Rays and Tigers in 2013?

Introduction

“They got hot at the right moment.” “They’re just lucky they peaked in October.” “It was just meant to be.”

These are all things that have been said about recent World Series winners. Ever since Major League Baseball switched to its current three-division system (and after adding a second wild card in 2012), it has made it more difficult for teams with the best records to win it all. This is because probably more than any other sport, baseball’s playoffs are so much different than its regular season.

Baseball’s 162-game regular season is a marathon of endurance and mental toughness. On the other hand, the playoffs are a sprint, with the winner often times being a team that by all traditional metrics (such as wins and winning percentage) is inferior. However, it is extremely difficult to predict when such a team will go on a World Series run. Even though there are several metrics to measure a player’s overall value to his team (such as WAR, or Wins Above Replacement), there is not a lot when it comes to statistics or groups of statistics that can best predict postseason success.

Michael Lewis’s Moneyball introduced the importance of on-base percentage (OBP) to many baseball fans, but I have determined through a simple regression analysis that statistic alone does not correlate to team postseason success. The general consensus among fans, commentators, and analysts is that having dominant pitching, particularly starting pitching, is the key to advancing far in the playoffs.

I agree that the most important variable on a playoff team is their starting pitching, but pitching alone doesn’t win you the World Series either. The 2013 postseason saw the Boston Red Sox in the ALCS beat the Detroit Tigers, a team that had what was considered to be the most dominant starting rotation in baseball. This was after they beat another team with excellent pitching, the Tampa Bay Rays, in the previous series. In a sport that has metrics to measure everything from speed on the base paths to the strength of an outfielder’s arm, there is no accepted metric that can accurately and consistently predict postseason success based on regular season performance. My goal was to see if I could find such a measure.

This is not a simple task. In an October 2013 article for ESPN’s Grantland, Rany Jazayerli wrote, “Trying to find the magic formula for postseason success has been the sabermetric community's version of trying to turn lead into gold: Many have tried, but none have entirely succeeded.” I first came up with the idea for this project after angrily watching the New York Yankees over the past decade consistently be one of the best teams in the league, but then lose in the postseason (often in in the division series).

Most fans and analysts pointed to the Yankees’ lack of quality starting pitchers post-2003 to why they couldn’t win in the playoffs after winning four of five World Series from 1996 to 2001. However, the Atlanta Braves, led by their dominating pitching trio of Greg Maddux, Tom Glavine and John Smoltz, had even more trouble in the postseason, winning only one World Series title from 1992 to 2005, despite winning the NL East title in all fourteen years. It amazed me how these teams could consistently dominate their respective divisions and leagues for 162 games, only to come out flat in a five or seven game series. It made me wonder if there were hints in a playoff team’s regular season statistics that could predict a successful postseason run.

For this research, I have defined postseason success as “playoff value” or PV. A PV of 1 means losing in the division series, 2 means losing in the Championship Series, 3 is losing in the World Series, and 4 is winning the World Series. Therefore, in order to find statistics that can predict postseason success, I ran hundreds of linear regression models, with the outcome variable PV, and with many different predictors.

Ability to Drive in Runs Without Hitting Home Runs
 
Hypothesis

For my research, I decided to focus mainly on regular season batting statistics of playoff teams from the past ten years (2003-2012). I did this for a few reasons. First off, as previously mentioned, it is widely accepted that good pitching beats good hitting in the playoffs. However, I think this only holds true when looking at conventional measures of “good” hitting, such as batting average and runs scored. Instead, it could be more important to look at team batting patterns and tendencies. It is my hypothesis that teams that have more simplistic batting approaches, or those that emphasize contact and putting the ball in play and deemphasize over-swinging to try to hit home runs, will be more successful in the postseason. The reasoning behind this is that the pitchers in the postseason are so dominating (the number of off days in the postseason means that teams usually only use three or four of their best starters), a team might only get one or two chances a game to get a rally going or drive in runs. And because the top pitchers in the playoffs, are usually less likely to give up home runs, it is important that when given the proper opportunity, teams are able to drive in runs without hitting home runs.

Results

I started by using the stepwise regression function in R in which, I predicted PV from the original 38 statistics I gathered. These statistics ranged from simplistic, such as hits and home runs, to advanced, such weighted on base average (wOBA) and weighted runs create plus per 600 plate appearances, to contact-based, such as groundball percentage and home run to fly ball ratio. The stepwise function took all possible predictors and entered and removed them from the regression model until all predictors in the model had a p value of less than .1.

The stepwise function gave me the following: PV ~ H + HR + BABIP + GBFB + LDp + HRFB + BUH + Swingp + Contactp. What this meant was that playoff value could be predicted by the combination of hits, home runs, batting average on balls in play, ground ball to fly ball ratio, line drive percentage, home run to fly ball ratio, bunt hits, swing percentage and contact percentage. After finding the summary of this model, I discovered it was statistically significant, as it had a p value of .038.

I was not surprised by a few aspects of the formula, as teams with higher LDp (line drive percentage) and GBFB (ground ball to fly ball ratio) stats usually mean they have more simplistic hitting approaches, as higher rates of hitting line drives and ground balls means that they aren’t over-swinging or trying to only hit home runs as much. However, it is very difficult to interpret these individual coefficients, due to the multicollinearity of the model.

This multicollinearity is caused by the high correlation between the variables in this model. For example, teams that usually have more hits are going to have more home runs, and a higher Batting Average on Balls in Play. After trying several other models that included variables that I thought would be significant (such as contact percentage, line drive percentage and zone contact percentage) I was still unable to find another model that was statistically significant, so I came up with another idea.

Be sure to check back tomorrow for Part 2 of Andrew's analysis.

Labels: , , , , , , , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home