Skip to content

Conversation

dfulu
Copy link
Member

@dfulu dfulu commented Aug 19, 2025

Pull Request

Description

This PR moves the normalisation to be in-place and hency by-passes some of the slow xarray functions.

This results in a modest speed up of 2.7 ± 1.2 % (range 1.01-4.96% across 7 trials) of the UK regional sampler. But one nice piece of this is that the normalisation is done only once for the UK concurrent sampler, rather than done separately for all concurrent GSPs in a single sample.

However, I wonder if this speed up warrants the change

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@dfulu dfulu changed the title (Speed-up) In place normalisation (Speed-up) In-place normalisation Aug 19, 2025
channel_means = self.means_dict["nwp"][nwp_key]
channel_stds = self.stds_dict["nwp"][nwp_key]

da_nwp.values = (da_nwp.values - channel_means) / channel_stds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to say that the in-place normalisation here means that xarray's broadcasting and dimension alignment checks won't happen so we would have to be sure that everything is in the correct order?

If so I couldn't see a test around config_normalization_values_to_dicts which is used above and seems to be the function which would define that order, should we add a test there or is there a test somewhere else that checks the ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since channel_means and channel_stds are just arrays, we were already skipping xarray's broadcasting and dimension alignment. I didn't quite look into everything that this skips past inside xarray, but I think in the old version there is internal copying happening inside the xarray sub and mult methods. Now we're just relying on the fast that channel_means/stds and da_nwp.values which are numpy arrays have compatible shapes for numpy broadcasting.

So no we don't have tests around config_normalization_values_to_dicts and the dimension order. Right now channel_means/stds have shapes [1, channels, 1, 1]. The sat data has shape [time, channels, x, y], and after time slicing the NWP has the same dimension order too.

So the alignment in the shapes is there in that they need to be compatible arrays for numpy broadcasting. But yes this produce make a silent error if we transposed the dimensions by accident, and if the norm values had the same number of channels as another dimension of the sat/NWP. But it will fail loudly if the number of channels doesn't match.

I think the risk is probably minimal, but I'm open to the idea that there could be a test. I'm just not sure what we'd be looking for in a test. Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, hmmm maybe we could throw in a check rather than a test, something like this

if da_nwp.dims[1] != "channel":
    raise ValueError(f"Expected second dimension to be 'channel', but got '{da_nwp.dims[1]}'") 

and a comment in the config_normalization_values_to_dicts function that the order is there to match how we transpose the dimensions

I agree the chance is small but it feels strange not add in some check for what could lead to a silent error, but I don't feel super strong on this so if you think this is overkill then we can skip it

Copy link
Member Author

@dfulu dfulu Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look and see if there are any performance implecations. We do a similar thing in the diff_channels() function so that seems reasonable

@Sukh-P
Copy link
Member

Sukh-P commented Aug 19, 2025

Although the performance gain is modest the change itself is straightforward so might be worth it, hope this doesn't feel like a contradiction to my comment on the other PR 😅 also nice one pulling out the normalisation function, think that makes it more readable and clearer responsibility separation of functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants