Comparing Our Rating Systems

What does “accurate” mean?

The most common test for a ranking system’s accuracy is predictive validity: if you use today’s ratings to predict tomorrow’s game outcomes, how often does the higher-rated team win? A system that does this well reflects true team quality, not just luck or schedule quirks.

A second test is stability: does a team’s rank change dramatically from week to week based on a single game, or does it move smoothly as more evidence accumulates? Stable ratings are more useful for seeding and planning. Volatile ratings can mislead early in the season.

How each system performs

Rating	Uses Goals?	Schedule-Adjusted?	Predictive Accuracy	Main Weakness
SRS	Yes — margin of victory	Yes (iterative)	Highest	Can overrate teams that ran up scores against weak opponents
KRACH	No — win/loss only	Yes (iterative)	Very high	A 1–0 win and a 7–0 win are treated identically
NPI	No — win/loss only	Yes (iterative)	Good	75% SOS weighting makes it sensitive to schedule imbalance
PairWise	Indirectly (via RPI)	Partially (via RPI)	Moderate	Designed for fairness and transparency, not prediction

Why goal differential predicts better

A team that wins 5–2 every game is giving you more information than a team that wins 2–1 every game, even if both finish with the same record. The margin reveals how consistently a team outplays opponents — which is harder to do by luck than simply winning close games.

This effect is especially pronounced in shorter seasons, like amateur leagues that play 20–30 games. With a small sample of games, win-loss records are noisier because a few lucky bounces can flip outcomes. Goal differential averages out over those same games and converges on true team quality faster.

The ±7 cap in our SRS is important: it means a team can’t climb the rankings by running up the score in blowouts. Only the first seven goals of margin per game count, so consistently competitive games matter more than lopsided ones.

Where win/loss systems (KRACH, NPI) still shine

Purely win-based systems have one major advantage: they can’t be gamed by score manipulation. A team that plays hard for the full 60 minutes of a 5–0 game provides the same information to KRACH as one that coasted to a 2–0 win. This matters when comparing teams across different divisions or when sportsmanship norms vary.

KRACH also has a clean probabilistic interpretation. If Team A has a KRACH of 800 and Team B has 200, Team A has an 80% expected win probability on a neutral site. No other rating gives you that directly.

Late in a long season, KRACH and SRS tend to converge — large sample sizes mean close and blowout wins even out, and both systems land on similar orderings. The biggest differences show up early in the season when the sample is small.

Why PairWise exists (and what it’s actually for)

RPI and PairWise were designed for a specific purpose: giving an NCAA selection committee a defensible, auditable process for choosing tournament teams. The 25/50/25 weighting in RPI wasn’t chosen because it’s statistically optimal — it was chosen because it’s simple to explain to coaches and athletic directors who might dispute a selection.

The head-to-head and common-opponent comparisons in PairWise add another layer of transparency: you can sit down with any two teams and explain exactly why one ranked above the other. That’s valuable for tournament seeding even if it’s not the most predictive approach.

The bottom line: use PairWise to decide tournament bids; use KRACH or SRS to predict game outcomes.

Would a hybrid rating do better?

Yes — and this is standard practice in serious sports analytics. Combining multiple independent systems tends to outperform any single method because their errors are uncorrelated. KRACH might underrate a team that’s been unlucky in close games. SRS might overrate a team that dominated weak opponents. When you average them, the noise cancels out.

The simplest useful hybrid is a composite rank: average each team’s rank across KRACH, SRS, and NPI. A team that consistently places in the top three across all systems has a much stronger claim to being the best team than one that dominates in only one metric.

A reasonable weighted approach:

35% KRACH — schedule-adjusted win probability, rigorous at scale
35% SRS — margin of victory, strongest early-season predictor
30% NPI — schedule-adjusted win percentage, independent signal

PairWise is deliberately left out of a composite — it’s doing a different job (tournament selection) and including it would dilute the predictive accuracy of the blend.

Practical takeaway for amateur hockey

In a typical amateur league season of 20–30 games, SRS is likely the single most informative number because goal differential accumulates useful signal faster than win-loss records in short seasons. A team that consistently outscores opponents by 2–3 goals per game but goes 10-6 due to bad luck in close games will look much stronger in SRS than in KRACH or NPI.

That said, showing all four ratings — as we do — is the most honest approach. When multiple systems agree on a team’s rank, you can be confident. When they disagree, it usually means something interesting is happening: a great record against a weak schedule, or a strong team that’s been unlucky in one-goal games.

All ratings are updated every Monday morning. Ratings only include games played through the current week’s cutoff date.