Engineering April 30, 2025 Rafael Quispe

Building the Gridvynt REST API: Design Decisions and Tradeoffs

A technical post on how we designed the forecast delivery API: the response envelope format, pagination strategy for long-horizon 72h outputs, webhook push architecture, and the authentication model.

We spent three months designing the Gridvynt forecast delivery API before writing the first customer integration — not because the data model was complex, but because the decisions made in an API design for operational energy software are hard to reverse and have downstream consequences for every integration built against it. This post documents the key design decisions, the alternatives we considered, and where we made tradeoffs that, in hindsight, we'd make the same way again.

The intended audience is integration engineers who are evaluating our API or who work in the energy data space and are building similar systems. We'll cover the response envelope format, forecast versioning strategy, pagination design, and webhook push architecture.

The forecast object model

The central resource in the API is a forecast object, which represents a single issued forecast for a single asset at a specific point in time. A forecast object contains:

{
  "forecast_id": "fcst_01hx...",
  "asset_id": "solar_co_75mw_site_a",
  "issued_at": "2025-04-15T06:12:34Z",
  "model_version": "correction_v2.4.1",
  "intervals": [
    {
      "interval_start": "2025-04-15T12:00:00Z",
      "interval_end": "2025-04-15T12:15:00Z",
      "p10_mw": 42.1,
      "p50_mw": 58.4,
      "p90_mw": 71.2,
      "source": "nwp_corrected"
    },
    ...
  ]
}

The issued_at timestamp records when this specific forecast run was finalized, not when the NWP model was initialized. The distinction matters for lineage: an issued_at of 06:12 UTC on a forecast initialized from the 06:00 UTC GFS run means the forecast reflects NWP data assimilated and output-available approximately 12 minutes before issue. For post-hoc analysis of how forecast accuracy varied with issue time, this level of precision is necessary.

The model_version field records the calibration model version that was active when this forecast was issued. When the correction model is retrained on new actuals data, the version increments. This allows customers to evaluate whether a change in their reported forecast accuracy correlates with a model version update, and allows our team to diagnose accuracy regressions by identifying which model version was active during a specific period.

Forecast versioning: the design problem we almost got wrong

The most consequential design decision was how to handle forecast updates. A 6-hourly update cycle means we issue 4 new forecasts per day for each asset, each overlapping with the prior forecast's horizon. A customer who requested the forecast for asset A at 06:00 UTC will receive a different (hopefully better) forecast for the 12:00–24:00 window when they request again at 12:00 UTC after the new model run completes.

Our first design draft had a simple GET /forecasts/latest endpoint that always returned the most recent issued forecast, and a GET /forecasts/{forecast_id} endpoint for retrieving specific historical forecasts. This worked in testing but created a problem in production: customers integrating the API into automated EMS workflows were polling /forecasts/latest on a 15-minute cycle and updating their EMS committed schedules on every response. When the forecast for a future interval changed by more than a threshold between the 06:00 and 12:00 update, the EMS re-optimization was triggered mid-morning — appropriate when the change was meteorologically significant, inappropriate when it was just model noise in a low-uncertainty regime.

We redesigned to expose the forecast update as a first-class object: each new issued forecast for the same asset generates a forecast_delta object containing the interval-by-interval change from the prior issued forecast, expressed both as absolute MW deviation and as percentage of prior P50. The delta object is delivered alongside the new forecast in webhook notifications and is available via GET /forecast-deltas with filtering by asset and time window.

The customer's EMS integration now evaluates the delta at the interval level and applies a materiality threshold — configurable per customer, typically 5% of nameplate or 3% of prior P50, whichever is larger — to decide whether a re-optimization is warranted. Small forecast updates (delta within threshold) are logged but do not trigger EMS actions. This moved the decision about when to re-optimize back to the operator, which is where it belongs, rather than making it a function of the API's update frequency.

Pagination design for large time range requests

A customer requesting historical forecast data for a 200 MW portfolio spanning 90 days at 15-minute resolution is requesting approximately 260,000 interval records — not a volume that should be returned in a single JSON response body, both for payload size reasons and for client memory management. Our initial implementation used offset-based pagination with page_size and page_offset parameters, which is familiar to most API consumers and easy to implement server-side.

The problem with offset pagination for time-series data is that it is unstable when new forecasts are being written while a paginated read is in progress. If the client is retrieving page 3 of a large historical query and a new forecast is written to the database for a time interval that falls within the already-returned page 2 window, the offset shifts and the client may either miss records or receive duplicates on subsequent page requests. For infrequently-updated historical archives this is a theoretical issue; for our 6-hourly update system it was a practical one.

We switched to cursor-based pagination using the forecast_id as the cursor value. The client receives a next_cursor field in the response envelope pointing to the first forecast_id beyond the current page boundary. Subsequent requests use cursor=fcst_01hx... to retrieve the next page from a stable position in the ordering. New forecasts written during a paginated read do not disrupt the cursor position, and the cursor values are deterministic and stable across multiple reads of the same dataset.

Webhook delivery architecture and reliability design

For customers who need low-latency forecast delivery — typically those integrating into real-time EMS workflows where polling on a 15-minute cycle adds scheduling overhead — we implemented a webhook push architecture. When a new forecast is issued for an asset, the webhook system delivers a POST request to the customer's registered endpoint within 60 seconds of forecast finalization.

The reliability design follows the standard approach for operational webhook delivery: at-least-once delivery with idempotency guarantees on the consumer side. We send each webhook event with a unique event_id in the payload and signature headers (HMAC-SHA256 over the payload body using the customer's signing secret). The customer's handler can verify the signature and use the event_id for deduplication if the event is retried.

Our retry schedule is exponential backoff: first retry at 30 seconds, then 2 minutes, 10 minutes, 30 minutes, 2 hours, 6 hours. If all retries fail within 24 hours, the event is marked failed and logged for review, and the customer's webhook configuration is flagged as potentially unreachable. We send a daily delivery health digest to the customer's registered admin email showing delivery success rates and any failed event log entries.

We made a deliberate decision not to guarantee exactly-once delivery — it is a significantly harder engineering problem and the operational requirement for forecast delivery is not strict exactly-once. A dispatch engineer whose EMS receives the same forecast twice (with the same event_id) and applies the deduplication logic correctly loses nothing. The engineering cost of exactly-once guarantees across a distributed delivery system would have added months to development without a commensurate operational benefit.

What we'd do differently

One decision we'd revisit is the interval timestamp convention. We shipped with interval-start timestamps after settling on it during internal design review. Within the first six months, three separate integration engineers at customer sites asked about converting to interval-end timestamps to align with their historian conventions. The timestamp translation is a single-line operation, but it generates confusion in API documentation and onboarding calls disproportionate to the technical complexity. An explicit timestamp_convention query parameter that lets the caller specify interval-start, interval-end, or interval-midpoint would have been the right design from the start.

The API is otherwise performing as designed. The forecast delta objects have become the most-used feature in customer EMS integrations, well beyond the utility we initially anticipated for them. Exposing the uncertainty change across update cycles — not just the new forecast, but how much it changed — turned out to be the operational signal that drives the highest-value dispatch decisions.

The forecast object model

Forecast versioning: the design problem we almost got wrong

Pagination design for large time range requests

Webhook delivery architecture and reliability design

What we'd do differently

More from the blog

What SCADA Data Quality Issues Mean for ML Bias Correction

Integrating with OSIsoft Pi System for Forecast Delivery