Skip to content

Variable site capacity #295

@zakwatts

Description

@zakwatts

This issue will address two new improvements:

  1. Changing the data loading pipeline for sites to expect a single file that contains all generation, capacity and corresponding metadata about a site. This differs from the current implementation which would expect a generation file and metadata file. The reason for this, is to reduce the number of files needed when creating site data to train, and to simplify/reduce the code needed to handle cases where there is a variable capacity vs time.
  2. Handling variable capacity. I created an initial solution to this in this PR Country site #239, but to account for some of the feedback to merge the generation and metadata file and to reduce the downstream code, it may be better to work off the main branch rather than this existing PR. (also as a few changes have come along since i started the PR).

Desired implementation:

When training is kicked off for sites, validation is done on the combined metadata/generation file to check it is in the expected format. It could be best for the open_site function (https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/load/site.py) to expect a capacity value for every generation value in the dataset. (even if the capacity stays the same across time). This should mean we can keep the code a bit cleaner than my implementation in #239. For example we would no longer need to add a parameter to the config and use the parameter downstream in https://github.com/openclimatefix/ocf-data-sampler/pull/239/files#diff-b40689b299c299259884e6f50e9df2f3460799a45f3de4496aa76d2d4b39f515 as i did in my implementation.
The open_site function should return the whole generation_ds dataset, not just extract the generation as it currently does cf76958#diff-c5d31e277f06a22594c11e50b6a1e61191deab35e6926c00be073ac69431101d, this commit introduce returning the it as site_da = generation_ds.generation_kw.

The sites data then gets processed downstream in the _get_sampler function here:

def _get_sample(self, t0: pd.Timestamp, location: Location) -> dict:
. Important to ensure that the correct capacity data is preserved here as I came across some interesting behaviour previously in how xarray will drop coordinates and dimensions which are not apart of the data variable of the dataset when doing this processing. (having a capacity which is data variable which is extracted to a coordainte at this stage could be a possible solution for this).

Tests:

It would be useful to test the following use cases:

  • Several sites with and without variable capacity
  • 1 site with and without variable capacity

Some things to note a keep in mind:

  • Moving from a dataarray to a dataset as we have two variables in our dataset (generation and capacity).
  • Backwards compatibility?
  • It might be cleaner to handle some of the data validation outside of the function for cleaner code.
  • A fix for solar cords has been implimented in f593355, which I was working on in Country site #239.
  • Sample streaming

Other potentially relevant issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions