-
-
Notifications
You must be signed in to change notification settings - Fork 39
Description
This issue will address two new improvements:
- Changing the data loading pipeline for sites to expect a single file that contains all generation, capacity and corresponding metadata about a site. This differs from the current implementation which would expect a generation file and metadata file. The reason for this, is to reduce the number of files needed when creating site data to train, and to simplify/reduce the code needed to handle cases where there is a variable capacity vs time.
- Handling variable capacity. I created an initial solution to this in this PR Country site #239, but to account for some of the feedback to merge the generation and metadata file and to reduce the downstream code, it may be better to work off the main branch rather than this existing PR. (also as a few changes have come along since i started the PR).
Desired implementation:
When training is kicked off for sites, validation is done on the combined metadata/generation file to check it is in the expected format. It could be best for the open_site
function (https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/load/site.py) to expect a capacity value for every generation value in the dataset. (even if the capacity stays the same across time). This should mean we can keep the code a bit cleaner than my implementation in #239. For example we would no longer need to add a parameter to the config and use the parameter downstream in https://github.com/openclimatefix/ocf-data-sampler/pull/239/files#diff-b40689b299c299259884e6f50e9df2f3460799a45f3de4496aa76d2d4b39f515 as i did in my implementation.
The open_site
function should return the whole generation_ds
dataset, not just extract the generation as it currently does cf76958#diff-c5d31e277f06a22594c11e50b6a1e61191deab35e6926c00be073ac69431101d, this commit introduce returning the it as site_da = generation_ds.generation_kw
.
The sites data then gets processed downstream in the _get_sampler
function here:
def _get_sample(self, t0: pd.Timestamp, location: Location) -> dict: |
Tests:
It would be useful to test the following use cases:
- Several sites with and without variable capacity
- 1 site with and without variable capacity
Some things to note a keep in mind:
- Moving from a dataarray to a dataset as we have two variables in our dataset (generation and capacity).
- Backwards compatibility?
- It might be cleaner to handle some of the data validation outside of the function for cleaner code.
- A fix for solar cords has been implimented in f593355, which I was working on in Country site #239.
- Sample streaming