-
Notifications
You must be signed in to change notification settings - Fork 0
E2E Tests for Plotting #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Overall this looks great — thanks for putting this together @LydiaFrance . I just have one bigger question: In It would also be good to run the tests in this PR against the latest checkpoints to make sure everything works as expected — I don’t have access to Baskerville. Have you already run this with a recent checkpoint? If not, maybe someone else can do that before we merge. I’ll leave a few smaller comments inline, but I wanted to raise these points early. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from that above question about the magnitude warning, this PR looks good to me. Have added some minor comments inline.
|
||
# Optional: visually mark NaNs as semi-transparent grey overlays | ||
|
||
def _overlay_nans(ax: Axes, arr: np.ndarray) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this function added if not currently used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because our data isn't applying any masks currently... I am unsure why! But I wanted to leave the function there for the future.
Thanks Sophie! Appreciate your review :) It does seem like a model quality test but it is actually about what colourscale has been chosen by the user. So if we imagine the user selected the plots can have their own colourbar, and the prediction map looks identical to the ground truth, but in reality the prediction scale is 100x the ground truth. The colours will make the prediction look good, so this additional subtitle is flagging that for the user. And the other kind of warning is that by sharing a colour scale, the prediction is not being shown adequately by the plot (the numbers are outside the colour limits). We definitely need model quality tests but this is specifically a warning about colour scale choice.
And no, I haven't run it in Baskerville. I was mostly making sure it works as expected locally. |
Let me check my understanding of the warning. I follow that if separate scales are used, a large magnitude mismatch could be visually hidden. For example: Ground truth values [0.7, 0.8, 0.9] → scaled to 0.7–0.9 Both maps would then look very similar despite being 100× apart. But under a shared scale of [0;1], the two maps would look very different from each other, e.g. predictions flattened at the bottom and ground truth at the top. Some detail might get squashed, but the magnitude gap would be obvious. Since the warning text says: Could you give me a concrete mini-example of values + shared scale settings where this warning would flag an actual issue, so I understand? For clarity, it might help to make the check for this warning explicitly conditional on the user’s colorbar choice (e.g. only add this warning if a shared scale is in use). Right now I don’t see that dependency reflected in the code. |
From in-person review:
LGTM apart from that |
I have changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't run this myself but @LydiaFrance demonstrated during co-working and I'm happy to merge this as-is.
@aranas @louisavz @jemrobinson I made things more explicit thanks to Sophie's comment above. Just to clear up some confusions about the different scenarios... Good PredictionSeparate![]() Shared![]() Looks like a good prediction because of colour range choiceAt a glance it look fine because we have separate colour bars: The warning highlights something is wrong. Check out the colour bar scale! With a shared colourbar it just clips out everything outside the colour range, anything from 1.1 to over 1000 so the magnitude warning tells the user these are wildly extreme values which they couldn't otherwise tell: Prediction is better but with extreme outliers beyond the colour limitsWe can see the issue with the shared colourbar because contour plot will show white instead, and this tells us how much is clipped out: ![]() ![]() Both too high and too low valuesWith a shared colourbar it is hard to see why the prediction is bad, the warning says both too high and too low values: Easier to see why the prediction is bad with separate colourbars: ConclusionShared colourbar: will clip out extreme values so easier to see something is wrong, but the user won't be able to tell how bad the prediction is as the actual values are just not displayed |
Thank you for those example @LydiaFrance , it clarified the purpose of the warnings for me. |
As our visualisations will influence hyper-parametrising and other refinement checks for people running models, I wanted to introduce tests which will check the plots and videos are being produced as expected. Testing code end-to-end means we are checking the behaviour on a known test case does what we expect.
To do this there are two ways the tests run:
Fake Data
This is a small synthetic array that is shaped a little like sea ice data [T, H, W] with 0-1 values. This means we can produce plots very quickly for testing. Catches shape mismatches or similar in the plotting code.
Example checkpoint
A minimal frozen
.ckpt
file has been saved inside tests. In future we would want to have end-to-end tests that include running a tiny version of the data through our models and afterwards test the plots with them, but for now this small file is helpful. There's nothing useful in this checkpoint, it's not from a real trained model.Confirms the full pipeline works: checkpoint -> model -> outputs -> callback -> figure objects (images/videos)
Warning Badge
I added functionality which puts red text under the title of the plots if something is wrong with the colour scale or the scale of the prediction in comparison to the ground truth.
What is tested
API behaviour
Subplot Layout
No overlapping panels and colourbars for...
Colour scaling
Range Check Reporting
Animations
ffmpeg
Fixtures
In python, pytest fixtures prepare some data or objects that exist during the test, they act as bioler plate code the tests can call when needed.
Here’s a concise, researcher-friendly “how to run the tests” section you could drop into your PR or your README. It’s minimal, practical, and shows the exact commands.
⸻
Running the tests with Pytest
We use pytest for testing.
By default this will also collect coverage information (because of our config). If you don’t care about coverage output, add
--no-cov
:If you just want to check the new visualisation code:
For example, only test the plotting API: