Making Sense of 147 Years of Baseball Stats with SAP Analytics Cloud

Baseball is a sport rich in data and statistics, but despite this surplus of data it’s extremely difficult to make sense of it all and to truly understand what drives success.

To kick off the 2018 baseball season, I thought that I would test drive SAP Analytics Cloud to see if I could make sense of all of this data.

Why the Casual Fan Doesn’t Understand Baseball Statistics

A full understanding of baseball data is often left to the “experts” for the following reasons:

  • Data is always pre-aggregated and summarized;  While it’s easy to answer the preconceived questions like “Who has the highest batting average?”, it’s difficult to answer new questions or to correlate difficult questions together.
  • Data is siloed and disconnected: It is easy to see a specific set of statistics (a player, a year, or a team), but difficult to compare multiple players across multiple teams across multiple eras.
  • Data is tabular with little visuals: Many sites allow you to sort and pivot, but few provide the ability to visualize and spots trends and outliers.

Enter Sean Lahman’s Data Set

Last year, I stumbled on Sean Lahman’s baseball archive. Sean provides a very robust collection of data sets on every imaginable baseball statistic. He provides multiple data sets around teams, players, hitting, pitching, fielding, salaries, awards, parks, playoffs, all-star games, and more.

Baseball Has a Data Modeling Issue

While most wouldn’t view baseball as having a complex data model, it does indeed present many real-life data challenges that most organizations face. Sean’s data set provides 28 different tables that don’t neatly “join” together. There’s a table for Teams with Teams, Years, and Players. There’s also a Batter table with Teams, Years, and Players. This same table is then replicated for Pitching and Fielding — and again for the Post Season, All Star Games, Hall of Famers, yearly award winners, and so on.

Each of these tables creates many-to-many relationships between them. (That is, many teams span many years and many players, many players play for many teams across many years, many all stars games are played across many years with many teams and players that participate, and on and on.) Furthermore, the Batting, Pitching, Fielding, and Post Season tables all have very similar fields across the entire table, like Games, Player, Team, Hits, Walks,  where a pitcher (in the NL) can pitch, field, and hit.

Anyways, without getting too techie, modeling this data went from 28 original tables to 12 distinct tables with three main fact tables.

What Kind of Questions Can We Answer with This Data?

If we look at all 147 years of baseball data, we can see all of the teams, franchises, and players that have come and gone.

If we drill down, we can quickly see that before the 20th century, game statistics weren’t always properly entered and the number of games played was not equal from season to season. It wasn’t until 1961 that baseball moved to a full 162 game season. And you can see a few dips in games due to strike shortened seasons (1981 and 1994), and during World War I.

Which Teams Win the Most? Which Lose the Most?

This is not a straight forward question. For some, winning can be determined by most championships, it can mean most postseason appearances, it can mean most regular season wins, or it can mean the highest winning percentages (for those newer franchises like the Angels). So here’s all of these statistics in one view.

And then we can see the top five franchises by era.

More Interesting Facts about Winners and Losers

While you can’t see if from this graphic, there are a few more interesting nuggets in this data:

  • Winning doesn’t equal championships. In the Longball era (1994-2005), the Atlanta Braves won the most games, but only won one championship. The Yankees were second in Wins and won four championships. But in the post steroid era (2006-present), the Yankees won the most, but didn’t win a championship.
  • You need to be dominant for many years to win championships.  Aside from the Florida Marlins and the Chicago Cubs, it is very rare for a team that is not dominant over an era to win a championship.
  • Bad teams don’t often stay bad forever.  When we pivot and look at the poorest performers,  we can see that many of the poorest performers don’t stay poor across multiple eras. You can see a few reversals of fortune for some poor performing franchises, like the Kansas City Royals and Houston Astros.

What Are “Winning” Teams Doing Better than the Others?

One key offensive statistic used to gauge good hitters is OPS (On Base + Slugging). The idea is that the more often you get on base, the better the on-base percentage (OBP). And the more bases you get (a double is worth more than a single, a triple more than a double, and a homerun more than a triple), the better your slugging percentage (S). If you add together both of these statistics, you get OPS (On time + Slugging).

What’s interesting about this statistic is that over the past 147 year: (1) the teams that win the most score the most runs, (2) the most runs are scored by teams with a high OPS, and (3) a high OPS leads to more wins.

You may ask, Doesn’t pitching win games? Well, that’s true as well. The two pitching statistics that I read about are WHIP and FIP. WHIP is the average of walk + hits per inning pitched. Similar to the hitting statistic: (1) teams that allow less runs win more, (2) teams with a low WHIP allow less runs, and (3) teams with a low WHIP win more.

Is It Better to Have Good Pitchers or Good Hitters?

Obviously, you need both and teams that are among the leaders in both OPS and WHIP almost always make the postseason.  But a team that in the Top 5 for pitching WHIP is more likely to make the postseason than a team that’s in the Top 5 for OPS.  Since the long ball era (last 22 years), 71% of the top pitching teams have made the postseason whereas only 58% of the Top 5 hitting teams make the postseason.  So pitching wins over hitting.

Who Are The Best OPS and WHIP Players?

If I’m going to try to recruit good hitters with a good OPS and pitchers with a good WHIP, who’s out there?  This is where the many-to-many joins come in. Teams have many players and play across many years and players play for many teams across many years, and players play for multiple teams.

If we look at the top hitters of all time, we can see that most are in the Baseball Hall of Fame—except the famous exceptions.  If we filter on players from the last three years that have had a minimum of 400 plate appearances, we can get a list of the following players—most of which are household names.

We can do the same exercise with pitching too. Based on pitching behavior, we can divide pitchers into “starters” and “relievers” based on whether they start games or not. If we filter for just the current era, we can get a list of pitchers with the lowest WHIP (RIP Jose Fernandez.)

Do I Need to Spend to Win?

This is an obvious stat, but winning teams have significantly higher payrolls than the losing teams.

And there’s a correlation between winning and salary. That is, the higher the salary, the more likely you are to win. We can also see that there are some very high payroll teams that don’t win (bottom right in gray) and a few low payroll teams that win (top left in blue). So while it’s not impossible, the odds are stacked against you. Some outliers include the Mariners (2001), Cubs (2016), Royals, and Yankees (1998) who had below average payrolls, but made the postseason.

But what seems a little contradictory is that since 2000, only 37% of the division winners were in the top of their division for payroll.

Here’s a look at the Top 10 high salary teams and the fact that only half of them made the postseason.

Which Players Have Teams Invested the Most In?

Here are some staggering numbers on the top players in baseball.

What Does All of This Mean?

While I was able to very easily ask and answer lots of questions in the data, my answers weren’t very eye-opening.  Here’s what I found:

  • These new statistics (OPS and WHIP) are excellent indicators of a team’s success and far better than the old ones (batting average and home runs).
  • Being in the Top 5 in both OPS and WHIP will give you a 98% chance of making the postseason.
  • Being a good pitching team is better than being a good hitting team.  71% of the teams that are in the Top 5 in WHIP make the postseason versus 58% in the top 5 in OBS.
  • To make the postseason, you need to spend 30% more than the average team’s payroll.
  • The top players (OPS and WHIP) are the top yearly earners.

What Next?

Just like in your organization, data is everywhere.  However, the true analytics is very often left to too few.  This is another fun example of how analytics can be applied to help understand what drives baseball success.

This story originally appeared on the SAP Analytics blog.