Skip to content

Conversation

LydiaFrance
Copy link
Contributor

@LydiaFrance LydiaFrance commented Sep 24, 2025

As our visualisations will influence hyper-parametrising and other refinement checks for people running models, I wanted to introduce tests which will check the plots and videos are being produced as expected. Testing code end-to-end means we are checking the behaviour on a known test case does what we expect.

To do this there are two ways the tests run:

Fake Data

This is a small synthetic array that is shaped a little like sea ice data [T, H, W] with 0-1 values. This means we can produce plots very quickly for testing. Catches shape mismatches or similar in the plotting code.

Example checkpoint

A minimal frozen .ckpt file has been saved inside tests. In future we would want to have end-to-end tests that include running a tiny version of the data through our models and afterwards test the plots with them, but for now this small file is helpful. There's nothing useful in this checkpoint, it's not from a real trained model.

Confirms the full pipeline works: checkpoint -> model -> outputs -> callback -> figure objects (images/videos)

Warning Badge

I added functionality which puts red text under the title of the plots if something is wrong with the colour scale or the scale of the prediction in comparison to the ground truth.

What is tested

API behaviour

  • plot_maps returns a valid PIL image.
  • video_maps returns a valid MP4/GIF video buffer.
  • These are the exact objects that callbacks log to wandb.

Subplot Layout

No overlapping panels and colourbars for...

  • 2 panel vs 3 panel figures
  • Horizontal vs vertical colourbars
  • Shared vs separate colourbars

Colour scaling

  • Difference plots use a symmetric diverging scale centred on 0
  • Tick marks are sensible
  • Shared colourbars apply the same vmin/vmax to both Ground Truth and Prediction

Range Check Reporting

  • Warnings are correctly generated if values fall outside the displayed colour range, or if the Prediction is much higher/lower than the Ground Truth
  • Confirms a warning badge is displayed under the figure title

Animations

  • Uses monkeypatching to prevent matplotlib trying to open a window and slowing down the testing
  • Doesn't require the CI environment to have ffmpeg

Fixtures

In python, pytest fixtures prepare some data or objects that exist during the test, they act as bioler plate code the tests can call when needed.

  • Fake data fixture
  • ModelTestOutput from the tiny checkpoint file
  • PlotSpec
    Here’s a concise, researcher-friendly “how to run the tests” section you could drop into your PR or your README. It’s minimal, practical, and shows the exact commands.

Running the tests with Pytest

We use pytest for testing.

pytest

By default this will also collect coverage information (because of our config). If you don’t care about coverage output, add --no-cov:

pytest --no-cov

If you just want to check the new visualisation code:

pytest tests/plotting --no-cov

For example, only test the plotting API:

pytest tests/plotting/test_api.py --no-cov

@LydiaFrance LydiaFrance linked an issue Sep 24, 2025 that may be closed by this pull request
3 tasks
@aranas
Copy link
Collaborator

aranas commented Oct 3, 2025

Overall this looks great — thanks for putting this together @LydiaFrance . I just have one bigger question:

In sanity.py, the magnitude mismatch check (_robust_p2_p98) feels more like a model quality issue than a visualization concern. Am I reading this right? Would it make more sense to move this into the model testing suite, or is there a specific reason it’s useful to surface at the plotting stage?

It would also be good to run the tests in this PR against the latest checkpoints to make sure everything works as expected — I don’t have access to Baskerville. Have you already run this with a recent checkpoint? If not, maybe someone else can do that before we merge.

I’ll leave a few smaller comments inline, but I wanted to raise these points early.

Copy link
Collaborator

@aranas aranas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from that above question about the magnitude warning, this PR looks good to me. Have added some minor comments inline.


# Optional: visually mark NaNs as semi-transparent grey overlays

def _overlay_nans(ax: Axes, arr: np.ndarray) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this function added if not currently used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our data isn't applying any masks currently... I am unsure why! But I wanted to leave the function there for the future.

@LydiaFrance
Copy link
Contributor Author

LydiaFrance commented Oct 3, 2025

Overall this looks great — thanks for putting this together @LydiaFrance . I just have one bigger question:

In sanity.py, the magnitude mismatch check (_robust_p2_p98) feels more like a model quality issue than a visualization concern. Am I reading this right? Would it make more sense to move this into the model testing suite, or is there a specific reason it’s useful to surface at the plotting stage?

Thanks Sophie! Appreciate your review :)

It does seem like a model quality test but it is actually about what colourscale has been chosen by the user. So if we imagine the user selected the plots can have their own colourbar, and the prediction map looks identical to the ground truth, but in reality the prediction scale is 100x the ground truth. The colours will make the prediction look good, so this additional subtitle is flagging that for the user. And the other kind of warning is that by sharing a colour scale, the prediction is not being shown adequately by the plot (the numbers are outside the colour limits). We definitely need model quality tests but this is specifically a warning about colour scale choice.

It would also be good to run the tests in this PR against the latest checkpoints to make sure everything works as expected — I don’t have access to Baskerville. Have you already run this with a recent checkpoint? If not, maybe someone else can do that before we merge.

I’ll leave a few smaller comments inline, but I wanted to raise these points early.

And no, I haven't run it in Baskerville. I was mostly making sure it works as expected locally.

@aranas
Copy link
Collaborator

aranas commented Oct 3, 2025

So if we imagine the user selected the plots can have their own colourbar, and the prediction map looks identical to the ground truth, but in reality the prediction scale is 100× the ground truth.

Let me check my understanding of the warning. I follow that if separate scales are used, a large magnitude mismatch could be visually hidden. For example:

Ground truth values [0.7, 0.8, 0.9] → scaled to 0.7–0.9
Prediction values [0.007, 0.008, 0.009] → scaled to 0.007–0.009

Both maps would then look very similar despite being 100× apart.

But under a shared scale of [0;1], the two maps would look very different from each other, e.g. predictions flattened at the bottom and ground truth at the top. Some detail might get squashed, but the magnitude gap would be obvious.

Since the warning text says:
“Prediction magnitude appears much lower than ground truth under the shared scale”

Could you give me a concrete mini-example of values + shared scale settings where this warning would flag an actual issue, so I understand?

For clarity, it might help to make the check for this warning explicitly conditional on the user’s colorbar choice (e.g. only add this warning if a shared scale is in use). Right now I don’t see that dependency reflected in the code.

@LydiaFrance
Copy link
Contributor Author

As requested, some images of what the figure looks like when there's a colour scale problem:
sea-ice_concentration-static-maps_0_cb77d946fdb2172519d1

@LydiaFrance
Copy link
Contributor Author

LydiaFrance commented Oct 8, 2025

And this is our dummy arctic data for testing:
image

@jemrobinson
Copy link
Member

From in-person review:

  • move make_circular_arctic to a conftest.py so it's available as a fixture

LGTM apart from that

@LydiaFrance
Copy link
Contributor Author

I have changed dummy to fake data in the code base, and changed sanity to range check.

Copy link
Member

@jemrobinson jemrobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't run this myself but @LydiaFrance demonstrated during co-working and I'm happy to merge this as-is.

@LydiaFrance
Copy link
Contributor Author

LydiaFrance commented Oct 8, 2025

@aranas @louisavz @jemrobinson

I made things more explicit thanks to Sophie's comment above.

Just to clear up some confusions about the different scenarios...

Good Prediction

Separate

image

Shared

image

Looks like a good prediction because of colour range choice

At a glance it look fine because we have separate colour bars:
image

The warning highlights something is wrong. Check out the colour bar scale!

With a shared colourbar it just clips out everything outside the colour range, anything from 1.1 to over 1000 so the magnitude warning tells the user these are wildly extreme values which they couldn't otherwise tell:
image

Prediction is better but with extreme outliers beyond the colour limits

We can see the issue with the shared colourbar because contour plot will show white instead, and this tells us how much is clipped out:

image image

Both too high and too low values

With a shared colourbar it is hard to see why the prediction is bad, the warning says both too high and too low values:
image

Easier to see why the prediction is bad with separate colourbars:
image

Conclusion

Shared colourbar: will clip out extreme values so easier to see something is wrong, but the user won't be able to tell how bad the prediction is as the actual values are just not displayed
Separate colourbar: will hide extreme values unless the user very carefully checks the scale every time.
The warnings should just flag things more easily.

@aranas
Copy link
Collaborator

aranas commented Oct 9, 2025

Thank you for those example @LydiaFrance , it clarified the purpose of the warnings for me.
I am happy for this to be merged

@LydiaFrance LydiaFrance merged commit 3b013f4 into main Oct 9, 2025
3 checks passed
@LydiaFrance LydiaFrance deleted the 86-plot-tests branch October 9, 2025 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add tests for plotting

3 participants