Sabermetrics: A Science
In this series I will describe the relatively new phenomena in baseball called sabermetrics. In the 2000's, sabermetrics have really started to gain popularity. Front office decision makers and hardcore, modern fans have really started to pay attention to these stats.
Why aren't sabermetrics well known by casual fans? Well, old habits die hard. Everyone loves batting average and home runs and ERA. However, these simple stats can be influenced by teammates, the ballpark, and worst of all, luck. I am taking on the responsibility of teaching you about the improved statistical analysis called sabermetrics
Installment V: PECOTA
To start off this series of blogs, I wrote an introductory article. I basically just said what I would be writing about, nothing special. I remember that in one of the comments on the initial blog, TampaGeek said that he is interesting in two stats, one of them being PECOTA. PECOTA is one of the more popular metrics, but I, personally, did not even know what it stood for until I started my research.
Having said that, PECOTA stands for Player Empirical Comparison and Optimization Test Algorithm. Yeah, it sounds pretty freaking complicated. I am not looking forward to the calculation of this metric so far. Interestingly, PECOTA is named after Bill Pecota, who was a journeyman in the MLB for a while. Pecota is basically the poster boy for PECOTA.
Baseball Prospectus, one of the best sites on the internet for your sabermetrics needs, which I am sure that you have gained since reading this series. You're welcome. Anyway, BP's Nate Silver developed this metric in 2003. This metric was introduced in the 2003 book, Baseball Prospectus. Baseball Prospectus has owned PECOTA every since it was publicly announced, however it was managed by Silver until 2009 when Baseball Prospectus took complete control of the projections.
BP has the most information on this formula. In fact, it has a massive glossary for statistics and metrics. There are also categories for this glossary. PECOTA gets its own category. I have a feeling this will have to be a very expansive article. Since 2003, they have made a number of improvements to PECOTA, such as including the state of the current MLB market and including platoon splits between the first and second halves of the season.
Interesting fact: A number of projection metrics have been inspired by PECOTA, such as the NFL's KUBIAK (named after Gary Kubiak), SCHOENE (Russ Schoene) in the NBA and VUKOTA (Mick Vukota) for the NHL.
Sigh. Time for the calculation. Prepare yourself.
"...all that PECOTA is really doing is carrying that to the logical extreme, where there is essentially a separate career path for every player in major league history. The comparability scores are the mechanism by which it picks and chooses from among those career paths."- Nate Silver
Ohhhh, that's what it does. Wait, what? Okay, so what Silver is saying (in English) is that he is trying to logically predict something that changes for every single player, the player's future. It seems pretty hard to do, but PECOTA is generally regarded as the most accurate way to predict a player's future.
Now, the calculation is quite complicated. It involves a large number of factor, but here are the three major categories used:
I.) Comparable Players:
Comparable players are the single biggest determining factor in calculating PECOTA. Baseball Prospectus has a conglomeration of 20,000 seasons since World War II. That's not all. "Comparable players" could almost have its own blog. It has four of its own categories to figure it out.
1.) Production - This is a simple one, the most basic. Like we do here on FanNation, this uses basic statistics such as batting average and slugging percentage for batters and strikeouts per nine innings and WHIP for pitchers.
2.) Experience/Career Duration - The number of seasons the player has played, age, their number of at bats, their number of innings pitcher. Again, this is pretty basic comparative stuff.
3.) Attributes - This is where the PECOTA comparisons differ from FanNation ones because it is trying to predict the future, not decide who is the better player. There is a comparison between height, weight and handedness.
4.) Position/Role - Again, pretty simple way to compare players. It further winnows out dissimilar players by picking out the players who play the same position or pitchers who have the same role (starter, closer, set-up man, etc.).
Finding comparable players is not difficult. Well, for them it isn't, considering the expansive database of around 20,000 seasons. Each player has a unique set of comparables. These are found by the ways shown above, by means of similarity scores.
This probability is not as simple as what we normally think of when we think of "probability". Instead, it is the more complicated probability distribution. Probability distribution is a way of setting a range of values through an algebraic equation. By doing this, PECOTA creates seven different predicted stat lines, each with their own level of confidence.
Ah, my favorite part of PECOTA! The use of sabermetrics! You know a stat is successful when it uses some of the most complicated metrics within its own calculation. Silver looked all the way back to 1946 and looked at how a player's season turns out. He created a very complicated formula based on which stats best predict future performances. For example, Silver, like most knowledgeable baseball statisticians, is aware that earned run average is just a terrible indicator. He uses better stats like Fielding-Independent Pitching and BAbip.
You will not be able to calculate PECOTA on your own, unless you somehow have access to Baseball Prospectus' ridiculously expansive player database. Silver has said himself that "it takes a village to run a PECOTA." You simply can't do this by yourself.
What it tells us:
Jeez. This is supposed to be the long section, not calculation. I guess this will just be one of my longer entries' Well, anyway, what PECOTA tells us is pretty obvious. It is a projection metric, therefore its purpose is to predict what a player will do, and it does so quite accurately, especially when compared to other metrics that serve the same purpose.
It is widely accepted as the best projection stat and it is great that Baseball Prospectus owns it, rather than an individual team. It predicts the next five seasons for an aging batter or pitcher, a batter or pitcher in their prime, or even a Minor Leaguer (it has a database of only Minor League batters as well). The fact that it includes things like height and weight is brilliant. Players can get overweight and therefore slow down and smaller players are more prone to breaking down late.
Of course, it is not a foolproof system as it is impossible to accurately predict a season, but it has been spot on in some situations. Here are the projected PECOTA division winners for this season:
NL East: Philadelphia Phillies (85-77) or Atlanta Braves (85-77)
NL Central: Saint Louis Cardinals (86-76)
NL West: Los Angeles Dodgers (86-76)
NL Wild Card: Phillies or Braves
AL East: New York Yankees (93-69) or Boston Red Sox (93-69)
AL Central: Minnesota Twins (83-79)
AL West: Oakland Athletics (85-77)
AL Wild Card: Yankees or Red Sox
These projections change very often, so you will find different projected standings than these out there. Keep in mind that PECOTA projections are not always accurate. Last year, they were off by a pretty large amount, but in '08 it was much closer. Here were the 2009 projections:
Projected AL East:
2.New York (97-65)
3.Tampa Bay (92-70)
4. Toronto (81-81)
1. New York (103-59) - 6 games off
2. Boston (95-67) - 3 games off
3. Tampa Bay (84-78) - 8 games off
4. Toronto (75-87) - 6 games off
5. Baltimore (64-98) - 12 games off
Projected AL Central:
4.Kansas City (75-87)
1. Minnesota (87-76) - 8 games off
2. Detroit (86-77) - 8 games off
3. Chicago (79-83) - 5 games off
4. Cleveland (65-97) - 19 games off
5. Kansas City (65-97) - 10 games off
Projected AL West:
2.Los Angeles (79-83)
1. Los Angeles (97-65) -18 games off
2. Texas (87-75) - 14 games off
3. Seattle (85-77) - 15 games off
4. Oakland (75-87) - 7 games off
Projected NL East:
1.New York (93-69)
1. Philadelphia (93-69) - 5 games off
2. Florida (87-75) - 13 games off
3. Atlanta (86-76) - 2 games off
4. New York (70-92) - 23 games off
5. Washington (59-103) - 18 games off
Projected NL Central:
3.St. Louis (80-82)
1. St. Louis (91-71) - 11 games off
2. Chicago (83-78) - 12 games off
3. Milwaukee (80-82) - 3 games off
4. Cincinnati (78-84) - 1 game off
5. Houston (74-88) - 8 games off
6. Pittsburgh (62-99) - 3 games off
Projected NL West:
2.Los Angeles (84-78)
3.San Francisco (79-83)
5.San Diego (74-88)
1. Los Angeles (95-67) - 11 games off
2. Colorado (92-70) - 14 games off
3. San Francisco (88-74) - 9 games off
4. San Diego (75-87) - 1 game off
5. Arizona (70-92) - 20 games off
Now, I know that I have been touting PECOTA as the most accurate prediction metric, but these projections above seem to prove otherwise. That is exactly why I used 2009. I didn't want it to seem like this was always accurate, because it isn't. However, it did call the Ray's 2008 season and was almost perfect on the Red Sox in '07.
- The difficulty of predicting the outcome of a season, whether on a team scale or for an individual player, is very high and that is why you will never find a flawless prediction system.
- Injuries, PECOTA doesn't account for injuries. Go look at the Mets prediction. Yeah...
- Availability. Although it should be readily available since BP owns it, I have not been able to find the PECOTA for a specific player's season.
- The calculation is very difficult and impossible for just a few people to do on their own.
PECOTA is a cool stat and very interesting to look at. While I think it is good in comparison to other projection metrics, I would never take it as the final say in any prediction. However, when it comes to sleepers, I would trust PECOTA more than my gut. I mean, if it predicted the Rays in '08, that's amazing. Simply amazing. It is not my favorite sabermetric, but, overall, it is fun to look at. The fluctuations in accuracy from year-to-year are a bit unnerving, but generally, it is close to the truth.