E2E Tests for Plotting #104

LydiaFrance · 2025-09-24T18:01:25Z

As our visualisations will influence hyper-parametrising and other refinement checks for people running models, I wanted to introduce tests which will check the plots and videos are being produced as expected. Testing code end-to-end means we are checking the behaviour on a known test case does what we expect.

To do this there are two ways the tests run:

Fake Data

This is a small synthetic array that is shaped a little like sea ice data [T, H, W] with 0-1 values. This means we can produce plots very quickly for testing. Catches shape mismatches or similar in the plotting code.

Example checkpoint

A minimal frozen .ckpt file has been saved inside tests. In future we would want to have end-to-end tests that include running a tiny version of the data through our models and afterwards test the plots with them, but for now this small file is helpful. There's nothing useful in this checkpoint, it's not from a real trained model.

Confirms the full pipeline works: checkpoint -> model -> outputs -> callback -> figure objects (images/videos)

Warning Badge

I added functionality which puts red text under the title of the plots if something is wrong with the colour scale or the scale of the prediction in comparison to the ground truth.

What is tested

API behaviour

plot_maps returns a valid PIL image.
video_maps returns a valid MP4/GIF video buffer.
These are the exact objects that callbacks log to wandb.

Subplot Layout

No overlapping panels and colourbars for...

2 panel vs 3 panel figures
Horizontal vs vertical colourbars
Shared vs separate colourbars

Colour scaling

Difference plots use a symmetric diverging scale centred on 0
Tick marks are sensible
Shared colourbars apply the same vmin/vmax to both Ground Truth and Prediction

Range Check Reporting

Warnings are correctly generated if values fall outside the displayed colour range, or if the Prediction is much higher/lower than the Ground Truth
Confirms a warning badge is displayed under the figure title

Animations

Uses monkeypatching to prevent matplotlib trying to open a window and slowing down the testing
Doesn't require the CI environment to have ffmpeg

Fixtures

In python, pytest fixtures prepare some data or objects that exist during the test, they act as bioler plate code the tests can call when needed.

Fake data fixture
ModelTestOutput from the tiny checkpoint file
PlotSpec
Here’s a concise, researcher-friendly “how to run the tests” section you could drop into your PR or your README. It’s minimal, practical, and shows the exact commands.

⸻

Running the tests with Pytest

We use pytest for testing.

pytest

By default this will also collect coverage information (because of our config). If you don’t care about coverage output, add --no-cov:

pytest --no-cov

If you just want to check the new visualisation code:

pytest tests/plotting --no-cov

For example, only test the plotting API:

pytest tests/plotting/test_api.py --no-cov

…acted for 2d

aranas · 2025-10-03T11:03:06Z

Overall this looks great — thanks for putting this together @LydiaFrance . I just have one bigger question:

In sanity.py, the magnitude mismatch check (_robust_p2_p98) feels more like a model quality issue than a visualization concern. Am I reading this right? Would it make more sense to move this into the model testing suite, or is there a specific reason it’s useful to surface at the plotting stage?

It would also be good to run the tests in this PR against the latest checkpoints to make sure everything works as expected — I don’t have access to Baskerville. Have you already run this with a recent checkpoint? If not, maybe someone else can do that before we merge.

I’ll leave a few smaller comments inline, but I wanted to raise these points early.

aranas

apart from that above question about the magnitude warning, this PR looks good to me. Have added some minor comments inline.

.pre-commit-config.yaml

aranas · 2025-10-03T12:17:35Z

ice_station_zebra/visualisations/plotting_maps.py


+    # Optional: visually mark NaNs as semi-transparent grey overlays
+
+    def _overlay_nans(ax: Axes, arr: np.ndarray) -> None:


Why was this function added if not currently used?

Because our data isn't applying any masks currently... I am unsure why! But I wanted to leave the function there for the future.

tests/plotting/test_dummy_arctic.py

tests/plotting/conftest.py

LydiaFrance · 2025-10-03T15:01:39Z

Overall this looks great — thanks for putting this together @LydiaFrance . I just have one bigger question:

In sanity.py, the magnitude mismatch check (_robust_p2_p98) feels more like a model quality issue than a visualization concern. Am I reading this right? Would it make more sense to move this into the model testing suite, or is there a specific reason it’s useful to surface at the plotting stage?

Thanks Sophie! Appreciate your review :)

It does seem like a model quality test but it is actually about what colourscale has been chosen by the user. So if we imagine the user selected the plots can have their own colourbar, and the prediction map looks identical to the ground truth, but in reality the prediction scale is 100x the ground truth. The colours will make the prediction look good, so this additional subtitle is flagging that for the user. And the other kind of warning is that by sharing a colour scale, the prediction is not being shown adequately by the plot (the numbers are outside the colour limits). We definitely need model quality tests but this is specifically a warning about colour scale choice.

It would also be good to run the tests in this PR against the latest checkpoints to make sure everything works as expected — I don’t have access to Baskerville. Have you already run this with a recent checkpoint? If not, maybe someone else can do that before we merge.

I’ll leave a few smaller comments inline, but I wanted to raise these points early.

And no, I haven't run it in Baskerville. I was mostly making sure it works as expected locally.

aranas · 2025-10-03T16:05:54Z

So if we imagine the user selected the plots can have their own colourbar, and the prediction map looks identical to the ground truth, but in reality the prediction scale is 100× the ground truth.

Let me check my understanding of the warning. I follow that if separate scales are used, a large magnitude mismatch could be visually hidden. For example:

Ground truth values [0.7, 0.8, 0.9] → scaled to 0.7–0.9
Prediction values [0.007, 0.008, 0.009] → scaled to 0.007–0.009

Both maps would then look very similar despite being 100× apart.

But under a shared scale of [0;1], the two maps would look very different from each other, e.g. predictions flattened at the bottom and ground truth at the top. Some detail might get squashed, but the magnitude gap would be obvious.

Since the warning text says:
“Prediction magnitude appears much lower than ground truth under the shared scale”

Could you give me a concrete mini-example of values + shared scale settings where this warning would flag an actual issue, so I understand?

For clarity, it might help to make the check for this warning explicitly conditional on the user’s colorbar choice (e.g. only add this warning if a shared scale is in use). Right now I don’t see that dependency reflected in the code.

LydiaFrance · 2025-10-08T11:16:55Z

As requested, some images of what the figure looks like when there's a colour scale problem:

LydiaFrance · 2025-10-08T11:34:26Z

And this is our dummy arctic data for testing:

…ute/ice-station-zebra into 86-plot-tests

jemrobinson · 2025-10-08T13:09:53Z

From in-person review:

move make_circular_arctic to a conftest.py so it's available as a fixture

LGTM apart from that

LydiaFrance · 2025-10-08T15:23:57Z

I have changed dummy to fake data in the code base, and changed sanity to range check.

tests/plotting/conftest.py

jemrobinson

Haven't run this myself but @LydiaFrance demonstrated during co-working and I'm happy to merge this as-is.

LydiaFrance · 2025-10-08T17:26:50Z

@aranas @louisavz @jemrobinson

I made things more explicit thanks to Sophie's comment above.

Just to clear up some confusions about the different scenarios...

Good Prediction

Separate

Shared

Looks like a good prediction because of colour range choice

At a glance it look fine because we have separate colour bars:

The warning highlights something is wrong. Check out the colour bar scale!

With a shared colourbar it just clips out everything outside the colour range, anything from 1.1 to over 1000 so the magnitude warning tells the user these are wildly extreme values which they couldn't otherwise tell:

Prediction is better but with extreme outliers beyond the colour limits

We can see the issue with the shared colourbar because contour plot will show white instead, and this tells us how much is clipped out:

Both too high and too low values

With a shared colourbar it is hard to see why the prediction is bad, the warning says both too high and too low values:

Easier to see why the prediction is bad with separate colourbars:

Conclusion

Shared colourbar: will clip out extreme values so easier to see something is wrong, but the user won't be able to tell how bad the prediction is as the actual values are just not displayed
Separate colourbar: will hide extreme values unless the user very carefully checks the scale every time.
The warnings should just flag things more easily.

aranas · 2025-10-09T08:49:02Z

Thank you for those example @LydiaFrance , it clarified the purpose of the warnings for me.
I am happy for this to be merged

LydiaFrance added 18 commits September 19, 2025 12:12

Create module for testing plots

5e5e8c0

Dummy arctic and SIC data for test plots

3449066

Restore comment

ffd54f9

Dummy SIC data stream

d393123

Switch to making stream only for simplification, single frame is extr…

d98285f

…acted for 2d

Comment out optional code for seeing mask

c98003f

Comment out optional code for seeing mask

99497b1

Option for matplotlib

71f1141

fix imports and exceptions

9241c24

Test plot spec

a686c46

Improve layout

823e64c

Fix file name for helper file

fa2c5af

tests for differencing and colour

cb7e52d

Tests the plot subplots layout

d2de360

Test e2e making images and video

4e97dd7

Fix mp4 dimension issue

a8dc43e

Improve plot layouts

66b2340

Include small checkpoint for testing

cde7f07

LydiaFrance linked an issue Sep 24, 2025 that may be closed by this pull request

Add tests for plotting #86

Closed

3 tasks

LydiaFrance and others added 11 commits October 1, 2025 15:30

Flag issues near title

befe617

Sanity checks for colourscale

8bcb3b6

Move to separate file

46589fa

Update plot spec

c8fba90

sanity report for init

8be8840

Make ticks clearer

96a777d

Refer to plot spec from types

a10e1ae

Merge branch 'main' into 86-plot-tests

738c466

Sort import fix

9560ce2

Fix for CI testing video writer

82bdfd6

Fix docstrings

4eb84f1

LydiaFrance requested review from IFenton and jemrobinson October 2, 2025 16:21

aranas reviewed Oct 3, 2025

View reviewed changes

Put back commented out precommit hook instruction

e7b3980

Merge branch 'main' into 86-plot-tests

c872429

LydiaFrance added 3 commits October 8, 2025 12:45

Fix matplotlib animation warning

97bbaad

Merge branch '86-plot-tests' of https://github.com/alan-turing-instit…

536b2a8

…ute/ice-station-zebra into 86-plot-tests

switch to dfferent plotting option

c492f25

LydiaFrance added 7 commits October 8, 2025 14:34

Fix for mypy figure attribute

fd788b8

Fix for ruff figure attribute

6c2e286

Move fake arctic fixture

88a56fd

Ruff fix for animation deletion warning

af2546c

Name change

5cb84cc

Change name to fake data

974d46b

init file

a8c10aa

jemrobinson reviewed Oct 8, 2025

View reviewed changes

tests/plotting/conftest.py Outdated Show resolved Hide resolved

jemrobinson approved these changes Oct 8, 2025

View reviewed changes

LydiaFrance added 2 commits October 9, 2025 11:00

Changed testing strategy to more explicit scenarios

2cc8029

Fix cast

c6ec511

LydiaFrance merged commit 3b013f4 into main Oct 9, 2025
3 checks passed

LydiaFrance deleted the 86-plot-tests branch October 9, 2025 10:12


		# Optional: visually mark NaNs as semi-transparent grey overlays

		def _overlay_nans(ax: Axes, arr: np.ndarray) -> None:

E2E Tests for Plotting #104

E2E Tests for Plotting #104

Uh oh!

Conversation

LydiaFrance commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fake Data

Example checkpoint

Warning Badge

What is tested

API behaviour

Subplot Layout

Colour scaling

Range Check Reporting

Animations

Fixtures

Running the tests with Pytest

Uh oh!

aranas commented Oct 3, 2025

Uh oh!

aranas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aranas Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

LydiaFrance Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LydiaFrance commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aranas commented Oct 3, 2025

Uh oh!

LydiaFrance commented Oct 8, 2025

Uh oh!

LydiaFrance commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jemrobinson commented Oct 8, 2025

Uh oh!

LydiaFrance commented Oct 8, 2025

Uh oh!

Uh oh!

jemrobinson left a comment

Choose a reason for hiding this comment

Uh oh!

LydiaFrance commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Good Prediction

Separate

Shared

Looks like a good prediction because of colour range choice

Prediction is better but with extreme outliers beyond the colour limits

Both too high and too low values

Conclusion

Uh oh!

aranas commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LydiaFrance commented Sep 24, 2025 •

edited

Loading

LydiaFrance commented Oct 3, 2025 •

edited

Loading

LydiaFrance commented Oct 8, 2025 •

edited

Loading

LydiaFrance commented Oct 8, 2025 •

edited

Loading