Methodology Desta Birru

Gradient Boosting vs. Deep Learning for Solar Site Calibration

We tested gradient boosting and neural network approaches on 18 months of actuals across six solar installations. The results were more nuanced than deep learning wins in every regime.

Gradient Boosting vs. Deep Learning for Solar Site Calibration

When we built the first version of the ML correction layer in 2022, the choice between gradient boosting and deep learning was not a theoretical discussion — it was a decision with practical consequences for deployment latency, retraining stability, and the amount of historical SCADA data we could realistically expect from new pilot customers. We have since run both approaches against actuals data across multiple sites. The results were more nuanced than a clean "deep learning wins" or "gradient boosting wins" framing suggests.

This post documents what we found and the specific conditions under which each approach performs better — not as a definitive answer for all forecasting contexts, but as an honest account of what the tradeoffs looked like on real operational data.

The experimental setup

We evaluated both approaches on 18 months of 15-minute SCADA actuals across six solar installations: two fixed-tilt utility-scale sites in the Colorado Front Range (75 MW and 120 MW respectively), two single-axis tracker sites in the high desert of New Mexico (200 MW and 85 MW), and two rooftop/commercial-scale installations included for comparison. NWP inputs were GFS 0.25° and NAM-3km, processed to site-level GHI via the Perez irradiance decomposition model. Target variable was inverter-level AC output, aggregated to site level.

For gradient boosting we used LightGBM, which has become the standard benchmark for tabular forecasting tasks due to its speed and regularization properties. For deep learning we evaluated a standard LSTM architecture and a Temporal Fusion Transformer (TFT). Each model was trained on a 12-month rolling window and evaluated on the subsequent 6-month hold-out period. We report normalized MAE (percentage of nameplate capacity) and probabilistic skill score against raw NWP baseline.

Performance results and the conditions that drove the differences

On the two Front Range fixed-tilt sites, LightGBM and the TFT performed comparably: both achieved approximately 3.8–4.2% normalized MAE on the hold-out period, compared to 5.6–6.1% for raw NWP. Skill improvement over NWP baseline was similar for both approaches — TFT showed a marginally better skill score at the 4–12 hour forecast horizon, LightGBM outperformed at the 12–36 hour horizon where longer-range feature interactions matter less and tabular feature engineering adds more value.

The performance gap appeared more clearly on the New Mexico tracker sites. The 200 MW tracker installation showed TFT outperforming LightGBM by approximately 0.6% normalized MAE on the hold-out set — a difference that is small in absolute terms but consistent across the evaluation period and concentrated in specific regimes: early morning tracker tracking initiation and late afternoon stow transitions, where the temporal self-attention mechanism in the TFT better captured the sequential production dynamics than the LightGBM feature lag structure.

The rooftop comparison sites showed a reversal: LightGBM significantly outperformed LSTM on small data sets (less than 90 days of clean training actuals), while TFT required substantially more data to overcome its initialization and regularization overhead. For customers with limited historical actuals — common in new installations — gradient boosting produces more reliable initial accuracy and degrades more gracefully as training data shrinks.

Training time and retraining cadence

Our target retraining cadence is every 6 hours, with the correction model updating on a rolling window of recent actuals. Training time on a standard CPU instance (no GPU) for LightGBM on a 90-day 15-minute actuals dataset (8,640 training samples, ~40 features) runs in 8–15 seconds. A full LSTM training run on the same dataset runs in 4–7 minutes without GPU acceleration; the TFT model runs 12–20 minutes.

This is not a concern for overnight batch training but becomes relevant when the retraining is expected to complete within the 6-hour update cycle including NWP download, feature construction, model training, inference, and API delivery. On a constrained compute budget — important for a bootstrapped company managing infrastructure costs — running TFT for every asset on every 6-hour cycle becomes expensive quickly. Our current production deployment uses LightGBM for the primary correction layer and runs TFT as a slower-cycle ensemble member on a 24-hour retrain schedule for assets where TFT showed meaningful skill improvement in the validation period.

Interpretability and operations team trust

An underappreciated dimension of the algorithm choice is interpretability. Grid operations teams — and their engineering management — are more willing to trust and act on a model output when they can explain why the model changed its prediction. SHAP value analysis of LightGBM models allows us to show, for any given 15-minute interval, which features drove the correction: "the model added +8 MW to the NWP baseline because the last three intervals showed GHI tracking 12% above NWP prediction, and the site historically produces above NWP estimate in this condition."

Explaining an LSTM's prediction requires the same interpretability post-processing and the explanations are less intuitively connected to the model's internal mechanism. The TFT's attention weights are more interpretable than LSTM hidden states but still require more translation effort to turn into an explanation that an operations engineer finds actionable rather than academic.

We're not saying that interpretability should override accuracy when the accuracy difference is large. We're saying that when the accuracy difference is small — as it was on most of our evaluation sites — interpretability is a real decision factor, because a model that operations teams understand is a model they use correctly, and a model they don't understand gets over-ridden or ignored at exactly the intervals where the model's output is most valuable.

The data volume threshold question

The results converge on a practical heuristic: below approximately 120 days of clean 15-minute SCADA training actuals, LightGBM is the more reliable choice. The gradient boosting regularization is more stable with small sample sizes, and the risk of LSTM over-fitting to idiosyncratic patterns in a short training window is significant. Above 180 days, the performance advantage of deep learning architectures on sites with complex temporal dynamics (trackers, high-variability orographic sites) becomes large enough to justify the additional training cost.

Between 90 and 180 days — the range that covers most early pilot deployments — the results are genuinely ambiguous and site-specific. The most honest answer in that window is to run both, compare on a rolling hold-out week, and switch to the better performer as more training data accumulates.

The market-facing claim that any single ML architecture is universally better for solar site calibration should be treated skeptically. The right answer depends on training data volume, site complexity, compute budget, and interpretability requirements — and it may be different per site within the same customer's portfolio. The forecasting system architecture that handles this heterogeneity — routing assets to the right model class based on measurable conditions rather than applying one algorithm to all assets — is more accurate than one that commits to a single approach for operational simplicity.

More from the blog