Variable site capacity

### This issue will address two new improvements:
1. Changing the data loading pipeline for sites to expect a single file that contains all generation, capacity and corresponding metadata about a site. This differs from the current implementation which would expect a generation file and metadata file. The reason for this, is to reduce the number of files needed when creating site data to train, and to simplify/reduce the code needed to handle cases where there is a variable capacity vs time.
2. Handling variable capacity. I created an initial solution to this in this PR #239, but to account for some of the feedback to merge the generation and metadata file and to reduce the downstream code, it may be better to work off the main branch rather than this existing PR. (also as a few changes have come along since i started the PR).

## Desired implementation:
When training is kicked off for sites, validation is done on the combined metadata/generation file to check it is in the expected format. It could be best for the `open_site` function (https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/load/site.py) to expect a capacity value for every generation value in the dataset. (even if the capacity stays the same across time). This should mean we can keep the code a bit cleaner than my implementation in #239. For example we would no longer need to add a parameter to the config and use the parameter downstream in https://github.com/openclimatefix/ocf-data-sampler/pull/239/files#diff-b40689b299c299259884e6f50e9df2f3460799a45f3de4496aa76d2d4b39f515 as i did in my implementation.
The `open_site` function should return the whole `generation_ds` dataset, not just extract the generation as it currently does https://github.com/openclimatefix/ocf-data-sampler/commit/cf769585c83ae4d9874666d14006b90e1c49d24e#diff-c5d31e277f06a22594c11e50b6a1e61191deab35e6926c00be073ac69431101d, this commit introduce returning the it as `site_da = generation_ds.generation_kw`.

The sites data then gets processed downstream in the `_get_sampler` function here: https://github.com/openclimatefix/ocf-data-sampler/blob/e87ec372e311b60891488ed06273271a0b871391/ocf_data_sampler/torch_datasets/datasets/site.py#L174. Important to ensure that the correct capacity data is preserved here as I came across some interesting behaviour previously in how xarray will drop coordinates and dimensions which are not apart of the data variable of the dataset when doing this processing. (having a capacity which is data variable which is extracted to a coordainte at this stage could be a possible solution for this). 

## Tests:
It would be useful to test the following use cases:
- Several sites with and without variable capacity
- 1 site with and without variable capacity

## Some things to note a keep in mind:
- Moving from a dataarray to a dataset as we have two variables in our dataset (generation and capacity).
- Backwards compatibility? 
- It might be cleaner to handle some of the data validation outside of the function for cleaner code.
- A fix for solar cords has been implimented in https://github.com/openclimatefix/ocf-data-sampler/commit/f593355070be94df5f2fb2c8c99d43e573c13b4d, which I was working on in #239. 
- Sample streaming

### Other potentially relevant issues
-  #178 
-  #46 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Variable site capacity #295

This issue will address two new improvements:

Desired implementation:

Tests:

Some things to note a keep in mind:

Other potentially relevant issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Variable site capacity #295

Description

This issue will address two new improvements:

Desired implementation:

Tests:

Some things to note a keep in mind:

Other potentially relevant issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions