Methodology January 14, 2025 Desta Birru

Measuring Forecast Skill Against NWP Baselines

How we define and report forecast skill score and why measuring improvement against the raw NWP baseline matters more than comparing against climatological persistence for operational grid decisions.

Forecast accuracy reporting in the energy industry has a persistent problem: vendors report MAE or RMSE in isolation, without reference to a baseline that allows the reader to judge whether the accuracy is good or just plausible-sounding. A 4.8% normalized MAE for a solar production forecast might represent a significant improvement over raw NWP — or it might be barely better than a naive persistence model. The number alone does not tell you which.

Forecast skill score is the metric that resolves this ambiguity, and it is underused in operational energy forecasting despite being standard practice in meteorological verification. The core concept is simple: skill measures how much better your forecast is than a defined reference forecast, expressed as a fraction of the reference forecast's error.

The skill score formula and what it means

The standard formulation for skill score against a reference baseline is:

SS = 1 - (MAE_model / MAE_reference)

A skill score of 0.20 means your model's MAE is 20% lower than the reference. A skill score of 0.0 means your model performs identically to the reference. A negative skill score means your model is worse than the reference — which is more common than it sounds, particularly for models evaluated outside their training distribution.

The choice of reference forecast is the critical decision. Three common choices are: (1) persistence — tomorrow's production equals today's production in the corresponding time interval; (2) climatological mean — production equals the long-run seasonal average for that hour, day-of-year, and site; or (3) raw NWP output — production converted from NWP irradiance or wind speed using a generic power curve without site calibration.

Each reference has a different implication. Skill against persistence answers: "is this model better than doing nothing?" Skill against climatology answers: "does this model add information beyond knowing the season and time of day?" Skill against raw NWP answers: "does the ML calibration layer actually improve on the base weather model?"

Why the choice of reference matters for energy decisions

For grid operations, the most decision-relevant comparison is skill against raw NWP, not against persistence. Here is why: a utility that is evaluating whether to purchase a commercial forecasting system already has access to free NWP output from NOAA's operational models. The relevant question is whether the commercial system's calibration and site-specific corrections are worth the cost differential over simply running an internal irradiance-to-power conversion on GFS output.

Reporting skill against persistence in this context is misleading because persistence performs poorly at multi-hour horizons — it degrades rapidly beyond 1–2 hours ahead as weather conditions evolve. Showing large skill against persistence at the 24-hour horizon tells you that your model is better than assuming tomorrow will look exactly like today, which is a low bar. It tells you nothing about whether the commercial system is worth the incremental cost over free NWP data.

This is an uncomfortable fact for forecast vendors: the more honest comparison is against raw NWP, and the skill improvement over that baseline is smaller than skill against persistence. For well-calibrated systems at sites with sufficient training data, skill improvement over raw NWP on 15-minute solar irradiance typically falls in the range of 15–30%, depending on site complexity, training data volume, and the accuracy of the base NWP for that geographic region.

The horizon-dependent skill profile

Forecast skill is not constant across forecast horizons. For solar production forecasting, the skill profile over a 72-hour horizon typically shows a characteristic pattern:

0–4 hours: Both persistence (solar-adjusted) and NWP perform reasonably well, but satellite-derived nowcasting outperforms both. Skill of ML-corrected NWP over raw NWP is moderate because the correction layer has limited value at near-zero-lead-time where the primary source of error is cloud position uncertainty that neither NWP nor ML can fully resolve.
4–24 hours: NWP skill dominates. The ML correction layer provides meaningful improvement here by handling systematic site-specific biases in NWP irradiance decomposition (GHI to DNI/DHI errors under partial cloud cover), terrain effects, and soiling-related production offsets.
24–72 hours: NWP accuracy degrades as synoptic predictability decreases. ML correction skill against raw NWP decreases proportionally because the ML correction is trained to correct NWP biases, not to improve NWP synoptic skill — if the NWP model places a cloud front 200 km off target, no amount of ML calibration at the site level corrects that.

Reporting a single aggregate skill score across all horizons obscures this structure. A vendor reporting "28% skill improvement over NWP baseline" may be aggregating a strong 12–24 hour skill improvement with a weak 48–72 hour skill to produce a number that looks better than it should for the decision horizon that matters most to day-ahead scheduling.

Conditioning skill on weather regime

Aggregate skill score across all days is an averaged measure that includes clear-sky days (where NWP and any reasonable model perform similarly well) and complex cloud days (where the performance differential is large). Breaking skill score down by weather regime reveals where the correction model is actually adding value.

A site in the Colorado Front Range might show skill improvement over raw NWP of 8% on clear days, 35% on days with orographic cumulus development, and 22% on frontal passage days. The clear-day skill is low because there is little to correct — NWP does fine on cloudless days with simple irradiance geometry. The orographic cumulus skill is high because local terrain-driven convection is systematically under-resolved by global NWP models and the site-trained correction captures the statistical pattern.

That regime-conditional breakdown is operationally useful for dispatch planners: it tells them that the forecast system's additional value is concentrated on the cloudy and partially cloudy days, which are also the days with the highest ramp event risk. On clear days they could use raw NWP with comparable accuracy; on complex cloud days the calibrated system provides a meaningful improvement in reserve commitment quality.

How to request honest skill reporting from forecast vendors

When evaluating a commercial solar or wind forecasting system, the minimum specification for skill reporting should include: the reference baseline used (persistence, climatology, or raw NWP — and which NWP model), the horizon-disaggregated skill profile (not a single aggregate), the evaluation dataset description (in-sample versus out-of-sample hold-out, evaluation period, site count), and the weather regime conditioning if available.

A vendor who presents only aggregate skill against persistence on an in-sample evaluation window is presenting the most favorable interpretation of their accuracy while omitting the comparison that matters most for purchasing decisions. Insisting on an out-of-sample hold-out evaluation against the NWP baseline is the standard due diligence step before committing to a production deployment — and it is the evaluation that the system should be able to pass if the calibration approach is genuinely sound rather than over-fit to the training data.

The skill score is not a proprietary methodology. It is a standard verification tool that any competent forecasting team can compute. The question of whether to report it honestly is a product integrity question, not a technical one.

The skill score formula and what it means

Why the choice of reference matters for energy decisions

The horizon-dependent skill profile

Conditioning skill on weather regime

How to request honest skill reporting from forecast vendors

More from the blog

Gradient Boosting vs. Deep Learning for Solar Site Calibration

ECMWF vs. GFS: Which NWP Model Performs Better for Solar Forecasting?