SCADA historian data is the training ground for ML bias correction models in renewable energy forecasting. But unlike a curated research dataset, operational SCADA feeds contain artifacts that accumulate over years of plant operation: sensor drift, communication failures, stuck values, firmware resets, and timestamp irregularities introduced by historian write conflicts. These are not edge cases — they are ordinary characteristics of production industrial data systems, and they propagate into ML correction models in specific, predictable ways.
Understanding how data quality issues affect model behavior is not a prerequisite for getting a forecast system running. But it is a prerequisite for trusting what the forecast system produces, and for diagnosing accuracy problems when they appear.
The three failure modes that matter for ML training
Stuck sensor values
A stuck value occurs when a sensor or the communication link between a field device and the historian fails to update for one or more consecutive intervals, causing the historian to log the same reading repeatedly until the connection is re-established. In a 5-minute SCADA feed, a stuck inverter output reading for four consecutive intervals looks superficially indistinguishable from a legitimate plateau in production — which happens legitimately near solar noon on cloudless days when the array is at or near rated output.
The problem for ML training is that stuck values during production ramps are the most damaging. A training sample that shows NWP irradiance increasing from 400 to 650 W/m² over 30 minutes while SCADA output stays flat at 40 MW (stuck) teaches the correction model a false relationship between irradiance change and production response. That false relationship reduces the model's ability to capture ramp dynamics — precisely the intervals most critical for reserve commitment decisions.
Detection is straightforward: flag any interval where the rate-of-change in SCADA output is zero while the NWP-derived irradiance rate-of-change exceeds a physics-based threshold. The threshold needs to be conditioned on the solar position and current irradiance level — a flat production reading near solar noon on a clear day is plausible, while the same flat reading during a modeled sunrise ramp is a quality flag.
Meter register rollovers and firmware resets
Some SCADA integrations report cumulative energy counters rather than interval power readings. When these counters overflow a register maximum or are reset during firmware updates, the derivative calculation used to recover interval power produces anomalous negative spikes (counter rollover) or large positive jumps (counter reset creating a spurious increase relative to prior interval). These are single-interval events but produce training samples with extreme values that corrupt gradient-based ML models if not removed.
A utility-scale solar site with cumulative energy metering rolling over a 32-bit register will appear to generate negative 4,000 MWh in the affected interval — obviously impossible, but the data pipeline may not have range-checked the derivative output before passing it to the training queue. Even one such sample in a 10,000-interval training set will introduce a large gradient signal that skews model weights toward compensating for phantom negative production.
Temporal misalignment between SCADA and NWP inputs
NWP model output is timestamped to the valid time of the forecast interval — typically the end of the interval for GFS and ECMWF conventions. SCADA historians frequently use interval-start or interval-end timestamping inconsistently, and the convention may vary across different tag configurations within the same historian installation. A 5-minute timestamp misalignment between NWP inputs and SCADA outputs causes the ML model to train on lagged input-output pairs — it learns to predict production from irradiance values that do not correspond to the correct physical interval.
For short intervals (5 or 15 minutes), a single-interval lag can reduce model correlation significantly because irradiance is rapidly changing at the dawn and dusk transitions and during ramp events — exactly the intervals where production predictability matters most. Checking the NWP-SCADA timestamp alignment convention is part of basic integration setup, but it is frequently assumed rather than verified.
Threshold-based exclusion versus imputation
Once quality issues are identified, the question is whether to exclude flagged intervals from the training set or impute replacements. The conventional approach is exclusion: flagged intervals are removed and the training window is extended until the effective clean sample count meets the minimum threshold (we target 2,000 15-minute intervals of clean training data per asset).
Imputation — substituting a physics-based estimate for a flagged interval — is appropriate when the data quality issue is sparse and the imputed value can be derived reliably from nearby clean intervals or from the NWP-direct estimate. It is not appropriate for extended outages (multi-hour stuck values, multi-day sensor failures) where the imputed sequence would constitute a substantial fraction of the training window and would introduce systematic errors of its own: the imputed values would be NWP-direct estimates, and training the ML correction model on NWP-imputed data teaches it to agree with the NWP baseline rather than to correct it.
We're not saying that imputation is always wrong — for isolated single-interval flags, imputed values are preferable to creating gaps in the time series that can affect time-lagged feature construction. We're saying that bulk imputation of extended SCADA outages defeats the purpose of training a site-specific correction model.
How data quality affects calibrated uncertainty estimates
ML bias correction models in renewable energy forecasting typically produce not just a corrected point estimate but a corrected distributional estimate — the P10/P50/P90 bands that grid operators use for reserve commitment. The calibration of these uncertainty bands depends on the residual distribution of training errors: if training residuals are computed on a contaminated dataset, the uncertainty estimates are miscalibrated.
Specifically, stuck-value episodes that persist through high-irradiance periods inflate the apparent forecast error in those regimes. The model sees large residuals when irradiance is high — driven by stuck sensor artifacts, not real production shortfalls — and learns to widen its confidence bands during high-irradiance intervals as a result. The P90 band becomes unnecessarily wide at solar noon, which causes dispatch operators to commit more spinning reserve than is warranted by the actual meteorological uncertainty. That is a direct, quantifiable cost driven by data quality contamination propagating into the uncertainty model.
Running post-hoc calibration verification — comparing stated P90 bounds against actual exceedance frequency on a hold-out set of clean actuals — will reveal this if the contaminated training data produced miscalibrated uncertainty. The diagnostic output to look for is a P90 that is exceeded significantly more or less than 10% of the time when evaluated on clean hold-out data.
The data quality audit as an integration prerequisite
In practice, the data quality audit should happen before the ML training window is established, not during. Pulling 90 days of raw SCADA actuals, running the outlier and consistency checks, and generating a data quality report with coverage statistics by interval and by data quality category takes less time than diagnosing poor model performance after the fact. It also gives the asset operations team actionable information about sensor health and historian configuration issues that they may not be tracking systematically.
A useful summary metric is the effective clean coverage rate: the fraction of all intervals in the training window that pass quality checks. An 85% clean coverage rate on 90 days of nominal 5-minute data means approximately 12,960 usable training intervals — well above the effective minimum. A 55% clean coverage rate on the same nominal window means 8,400 intervals — enough to train, but with concerning implications about the systematic nature of the quality failures that removed 45% of the data.
When clean coverage drops below 70%, the right response is usually to extend the historical window rather than train on the contaminated samples. The quality degradation that caused the low coverage rate often clusters in time — a specific sensor failure period, a historian migration, a plant availability event — and extending the window backwards typically recovers clean coverage without requiring imputation of the contaminated period.