First Month of MLB — Standings, Analysis, Projections

RyanSportsAnalytics
11 min readMay 1, 2021

--

Is there any better way to judge a baseball team than in the first month of the season? Of course not. That’s what I’m going to do. Some of the expected bottom dwellers are scorching hot to start the season. Are they bound to make the playoffs based on just a handful of games? Several strong playoff contenders who only got stronger in the off season are struggling to string together wins. Are they in danger of missing the playoffs?

People love to judge a team and their chances at success off a small subset of the season, especially at the start of a season. This year is no different. Many are anxious to see baseball return — a season of more than 60 games, one that promises a return to the 162 game normalcy we’ve become accustomed to. Just how predictive is a month of baseball? How valuable are those first 25 of the season?

Let’s first assume that winning percentage at the end of April is predictive of winning percentage at the end of the season. That means that we plot April winning percentage on the x-axis as the independent variable and end of season winning percentage on the y-axis as the dependent variable. Below is a plot of 20 years of data: 30 teams, each with winning percentage data from 2000–2019. I decided to exclude 2020 because (1) the season didn’t start in April and (2) a month’s worth of games in 2020 was almost half the entire season, a month’s worth of games in any other year is less than 20% of a season.

Figure 1: Winning percentage after one month and at the end of the season

Correlation isn’t bad. The data makes a nice ellipse (essentially an oval) around 0.500, as expected. Let’s try for a more robust relationship.

Pythagorean Winning Percentage

There are also two other ways to calculate expected winning percentage. These go beyond traditional wins and loses and look to use runs scored and runs allowed to better predict a team’s expected record. For example, a team scoring 500 runs and allowing 500 runs is likely a 0.500 team. However, you wouldn’t expect a team that scores 600 runs and only let’s up 400 runs to have a 0.500 record. You would expect a higher winning percentage.

The first calculation is Pythagorean winning percentage. It’s commonly seen on Baseball Reference and FanGraphs. It is calculated as follows:

W% = [(Runs Scored) ^ 2] / [(Runs Scored) ^ 2 + (Runs Allowed) ^2]

Much like I did with regular winning percentage, I plotted April winning percentage and end of season winning percentage, but with the Pythagorean winning percent instead.

Figure 2: Pythagorean wining percentage after one month and at the end of the season

Looks a bit better than regular winning percentage. Same kind of shape and slightly tighter grouping. Let’s look at one more method.

Linear Regression Winning Percentage (LinReg%)

A far more novel is linear regression. Not so much novel in the sense that we’ve never used linear regression before, but rather in a baseball winning percentage context. Essentially this uses the difference in runs scored and runs allowed to estimate the relationship with winning percentage (slightly different that Pythagorean).The larger the difference the more wins you’d expect. I recommend checking out this piece from the guys over at sabr* to see more about how it’s calculated. I used these formulas to develop an equation for the whole 20 year period as opposed to one single year.

Figure 3: Linear regression winning percentage after one month and at the end of the season

The biggest takeaway — while it might be a bit difficult to notice at first — is regression toward the mean. After the first month there are dozens of teams above 0.600, yet only a few manage to finish above that mark at the end of the season. In the same token, there are dozens of teams below 0.400, yet only a few manage to finish below that mark at the end of the season. What better way to visualize this than to check out the graphs. In each of the three there are only a handful of teams to the right of 0.600 on the x-axis who also find themselves above the green “perfectly correlated” line. Only a handful of teams left of the 0.400 on the x-axis find themselves below the green “perfectly correlated” line. What’s fascinating is that a team under 0.400 is significantly more likely to finish sub 0.400 at the end of the season than a team over 0.600 finishing above 0.600 at the end of the season. Here’s a breakdown:

17.9% of teams that start 0.600 or better finish the season with a winning percentage over 0.600: 21 of 117 teams

39.5% of teams that start 0.400 or worse finish the season with a winning percentage under 0.400: 51 of 129 teams

While teams at each end of the spectrum are likely to regress toward the mean — a 0.500 winning percentage — worse teams are even more likely to stay bad throughout the year.

More Trends

Here’s another interesting tidbit about the deviation: 39.5% of teams finish the season within 0.050 percent of their winning percentage after April, meaning that 60.5% of teams finish outside of 0.050 percent of their winning percentage after April. For example, let’s say the Blue Jays finish the month of April at 14–14, good for a 0.500 winning percentage. There’s only a 39.5% chance — based on all the historical data dating back to 2000 — that the Blue Jays finish with a winning percentage between 0.450 and 0.550. If we put that in terms of wins and losses, that’s a record of 73–89 and 89–73. To emphasize, a team that starts at 0.500 after the month of April has a 60.5% chance of winning less than 73 games or winning more than 89 games. That seems like a pretty big spread.

There’s a couple more things to look at. Since two of the methodologies above look at how runs scored and runs against play into wins and losses, let’s look at how those change over the course of the season too. Of the 600 team seasons studied (30 teams x 20 seasons), 60.2% finished the season within 0.5 runs of their offensive output at the end of April. Along the same lines, 60.7% finished the season within 0.5 runs of what they let up after the month of April. For example, let’s say the White Sox scored 4.5 runs per game and let up 5.0 runs per game in the 25 games they played by the end of April. There’s a 60.2% chance they finish the season averaging between 4.0 and 5.0 runs scored per game. There’s also a 60.7% chance they finish the season averaging 4.5 to 5.5 runs allowed per game. Even so, a half run spread in either direction is rather significant.

2021

Let’s return to this year. We’ve just wrapped up the first month of baseball of 2021. It’s the first time we’ve seen baseball in the month of April since 2019. The absence of a full season in 2020 makes the start of 2021 all the more interesting. With all this background how should we interpret the standings as they appear now? Here’s a look at what the standings look like at the end of April 2021. It is sorted by league and further by traditional winning percentage.

Figure 4: Standings at the end of April 2021

Most teams are right around the 25 game mark — that’s about 15% of a full season. Let’s see how each team would finish their season using the percentages above: actual winning percentage, Pythagorean winning percentage, and linear regression winning percentage. Since we’ve seen how Pythagorean and linear regression after one month correlate more closely with winning percentage at the end of the season, those are the numbers to focus on. Below is a figure with each of the 30 MLB teams. It includes each of the three percentages and the corresponding number of wins.

Figure 5: Projected win from each of the three winning percentage metrics

I added a color scale to highlight the teams at the top as well as the teams as the bottom. Green is good, red is bad. There’s a clear green to red transition for traditional win percentage as expected, but that’s not the case for the other two metrics. Kansas City surprised us all by taking a firm lead in the AL central. The Oakland A’s lost the first six games of the season but later won 13 in a row to place themselves atop the AL West. The Milwaukee Brewers proved all the AL central doubters wrong — many under the impression we may not even see a 0.500 team in that division — with a winning percentage over 0.600 out of the gate. But what do all these teams have in common? Their underlying winning percentage metrics, Pythagorean and linear regression , are significantly different than their actual percentages. And in the wrong direction. Kansas City and Oakland aren’t even playing 0.500 ball according to them. Milwaukee is playing at a clip that would net 10 more wins over the course of the season than expected. On the other end of the spectrum, the Toronto Blue Jays, New York Yankees, and Miami Marlins are doing just the opposite — underperforming their underlying numbers.

I also decided to take things one step further. What better way to predict the future than looking at teams in the past that are the most similar? In other words, how did every other team at the same winning percentage after one month, ultimately finish the season? For example, the Tampa Bay Rays started the 2021 season at a 0.460 LinReg%. There were 39 teams that had a LinReg% within 0.01 of 0.460 in the last 20 years. What were their outcomes and what can we predict from them? Breaking this down further, I am essentially finding teams whose run differential is identical as that is the main basis for LinReg%.

A difference of 0.01 winning percentage is equal to a difference of approximately a 3 run differential between runs scored and runs allowed. (This all according to the linear regression formula I developed.) Think of a scenario where team A scores 100 runs and gives up 88 runs. Run differential equals 12 runs. Team B scores102 runs and gives up 87 runs — runs scored minus runs allowed is 15 runs. Team C scores 101 runs and allows 85 runs — runs scored minus runs allowed is 16. Team A and B would be grouped together (15 minus 12 equals 3). Team C would be grouped with Team B but not Team A (16 minus 15 equals 1, but 16 minus 12 equals 4). As you can see the teams in a group are almost identical.

Anyway, each team in 2021 has at least 25 teams in their cohort — teams that share an exceptionally similar linear regression winning percentage from the last 25 years. I found the maximum and minimum LinReg% as well as the average LinReg% at the end of the season for each of the teams in 2021. This allows us to see what the best case scenario was for a team in their position, worst case scenario for a team in their position, and how it averages out.

Figure 6: Projected wins based on historical data from 2000–2019

Surprisingly, (or somewhat unsurprisingly), the averages deviated only a few wins from 0.500. Maximums and minimums varied wildly, showing that even some teams that had a putrid start were still able to win close to 100 games, and some teams that started hot struggled to win 70 games. While this helps us understand that there’s a high density of teams that finish where you might expect, it also shows the volatility in baseball. While it’s highly unlikely a team wins 100 games starting the season below 0.500, it’s not impossible. The same goes for a team losing 100 games after starting the season above 0.500.

If every team finished based on the average seen from the past 20 years of data, we would only see a single team win more than 90 games and a single team win less than 70 games. I think I can say with relative confidence that the Dodgers won’t be the only ones to surpass the 90 win mark. The Tigers probably won’t be the only team to go south of 70 wins either. Why is this? Well, every team isn’t a carbon copy of teams from the past. LinReg% from month one isn’t even close to perfectly correlated with final win percentage. Injuries derail teams. Scheduling may be easier in the first month and significantly tougher the rest of the way or harder at the start and easier at the end. There are tons of variables that influence a team’s final season record outside of the run differential in the first month of the season. Many of which aren’t even quantifiable.

Other Fun Tidbits

While analyzing all this data I found a ton of interesting tidbits about the history of the first month of baseball. I thought I’d share some of the coolest ones below.

There were 4 teams that started below 0.500 (LinReg%) and finished with 100 wins. This included:

2001 Oakland A’s (who also had an abysmal 8–17 record to start the year), 2002 Atlanta Braves, 2004 New York Yankees, and 2009 New York Yankees. This means that only 4 teams that started the season with a negative run differential ended the season with 100 wins. None in the last 11 years.

There were 3 teams that started above 0.500 (LinReg%) and finished with 100 losses. This included:

2008 Seattle Mariners, 2010 Seattle Mariners, and 2012 Houston Astros. This means that only 3 teams that started the season with a positive run differential ended the season with 100 losses. None in the last 8 years.

The largest decline in winning percentage from month one to the end of the season belongs to the 2011 Cleveland Indians. They started 18–8 in the month of April scoring 5.42 runs and allowing just 3.65 runs per game. They finished below 0.500 at 80–82 scoring just 4.35 runs and allowing 4.69 runs per game.

The biggest improvement in winning percentage from month one to the end of the season belongs to the 2006 Minnesota Twins. They started 9–15 scoring just 4.04 runs and allowing an astounding 6.17 runs per game. They turned it around winning 96 games and losing just 66. They improved their runs scored to 4.94 and runs allowed to 4.22 per game.

Sources

*https://sabr.org/journal/article/a-new-formula-to-predict-a-teams-winning-percentage/

https://www.baseball-reference.com/

https://www.baseball-reference.com/

--

--