Skip to content

Conversation

JeevanChevula
Copy link
Contributor

@JeevanChevula JeevanChevula commented Sep 2, 2025

Fixes #934

This PR adds a "saved_checkpoint" event that fires after successful checkpoint saves.

Usage:

# Users need to register the event first
engine.register_events("saved_checkpoint")

@trainer.on("saved_checkpoint")
def after_saving(engine):
    print("Checkpoint saved!")

@github-actions github-actions bot added the module: handlers Core Handlers module label Sep 2, 2025
@JeevanChevula JeevanChevula marked this pull request as ready for review September 2, 2025 06:16
@JeevanChevula JeevanChevula marked this pull request as draft September 3, 2025 06:51
@github-actions github-actions bot added the docs label Sep 3, 2025
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 3, 2025

@JeevanChevula thanks for the PR. However, let's rework the API of the new feature you are working on:

  1. Let's avoid using string type as event name and do not ask users to register the event. This should be done internally by ignite.

  2. Let's define the event in the Checkpoint as following:

# checkpoint.py

class CheckpointEvents(EventEnum):
    SAVED_CHECKPOINT = "saved_checkpoint"


class Checkpoint(...):
    SAVED_CHECKPOINT = CheckpointEvents.SAVED_CHECKPOINT
    ...
  1. finally the usage can be something like this:
from ignite.engine import Engine, Events
from ignite.handlers import Checkpoint, global_step_from_engine

trainer = ...
evaluator = ...
# Setup Accuracy metric computation on evaluator.
# evaluator.state.metrics contain 'accuracy',
# which will be used to define ``score_function`` automatically.
# Run evaluation on epoch completed event
# ...

to_save = {'model': model}
handler = Checkpoint(
    to_save, '/tmp/models',
    n_saved=2, filename_prefix='best',
    score_name="accuracy",
    global_step_transform=global_step_from_engine(trainer)
)

evaluator.add_event_handler(Events.COMPLETED, handler)

# ---- New API with Checkpoint.SAVED_CHECKPOINT event: -----

@evaluator.on(Checkpoint.SAVED_CHECKPOINT)
def notify_when_saved(eval_engine, chkpt_handler):   # we should pass to the attached handlers the engine and the checkpoint instance.
    assert eval_engine is engine
    assert chkpt_handler is handler
    print("Saved checkpoint:", chkpt_handler.last_checkpoint)
    
# ---- End of New API with Checkpoint.SAVED_CHECKPOINT event: -----

trainer.run(data_loader, max_epochs=10)
> ["best_model_9_accuracy=0.77.pt", "best_model_10_accuracy=0.78.pt", ]

Let me know what do you think?

@JeevanChevula
Copy link
Contributor Author

Thanks for the suggestion . I’ll try to work on updating the PR to follow the API approach you mentioned with

@JeevanChevula JeevanChevula marked this pull request as ready for review September 5, 2025 09:58
@JeevanChevula
Copy link
Contributor Author

Implementation Note:

Implemented EventEnum-based SAVED_CHECKPOINT event as requested. However, Ignite's event system only supports single-parameter handlers - the originally requested two-parameter signature (handler(engine, checkpoint_handler)) failed during event firing and registration.

Current implementation uses single parameter with checkpoint access via engine._current_checkpoint_handler. All 61 core tests pass, confirming functionality works without breaking existing features. The 3 distributed test errors are pre-existing infrastructure issues unrelated to this change.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this PR @JeevanChevula
I left few more comments to improve the PR

checkpoint["checkpointer"] = self.state_dict()

# Store reference to self in engine for event handlers to access
engine._current_checkpoint_handler = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this code? I do not understand why we need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround for Ignite's event system limitation. You originally requested a two-parameter handler signature handler(engine, checkpoint_handler), but Ignite's fire_event() only supports single parameters and rejects handlers expecting additional arguments. This line stores the checkpoint reference in the engine so handlers can access it via engine._current_checkpoint_handler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check details about this limitation. Unfortunately engine._current_checkpoint_handler is not good enough for a public API usage.

Alternatively, we can pass the instance of the checkpointer as an arg when attaching:

checkpoint = Checkpoint(...)

@trainer.on(Checkpoint. SAVED_CHECKPOINT, checkpoint)
def handler(engine, chkpt_handler):
    assert engine is trainer
    assert chkpt_handler is checkpoint

Maybe, we can skip automatic chkpt_handler arg injection:

checkpoint = Checkpoint(...)

@trainer.on(Checkpoint. SAVED_CHECKPOINT)
def handler(engine):
    assert engine is trainer
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeevanChevula please remove this private attribute

@JeevanChevula
Copy link
Contributor Author

Pushing current implementation with working SAVED_CHECKPOINT event functionality. Will add proper Google-style docstrings with version directives by Monday per contributing guidelines

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 12, 2025

@JeevanChevula please rebase your PR branch, you have now some extra commits

@JeevanChevula JeevanChevula force-pushed the add-saved-checkpoint-event branch from d500fc8 to d81faa9 Compare September 18, 2025 11:03
@JeevanChevula JeevanChevula marked this pull request as draft September 18, 2025 12:19
@JeevanChevula JeevanChevula force-pushed the add-saved-checkpoint-event branch from d81faa9 to fe4942d Compare September 19, 2025 05:42
@JeevanChevula JeevanChevula force-pushed the add-saved-checkpoint-event branch from fe4942d to 25e6adf Compare September 19, 2025 12:24
Checkpoint Events
-----------------

.. versionadded:: 0.5.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added to the docstrings. Can you locally render the docs and show how this look like?

Suggested change
.. versionadded:: 0.5.0
.. versionadded:: 0.5.3



class CheckpointEvents(EventEnum):
"""Events fired by Checkpoint handler"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Events fired by Checkpoint handler"""
"""Events fired by Checkpoint handler
.. versionadded:: 0.5.0
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback 🙏

I will update checkpoint.py so that the CheckpointEvents class docstring includes the version:

class CheckpointEvents(EventEnum):
    """Events fired by Checkpoint handler
    
    .. versionadded:: 0.5.3
    """
    SAVED_CHECKPOINT = `"saved_checkpoint"

One clarification: should I completely remove the Checkpoint Events section from docs/source/handlers.rst, or just update the version there from 0.5.0 → 0.5.3?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One clarification: should I completely remove the Checkpoint Events section from docs/source/handlers.rst, or just update the version there from 0.5.0 → 0.5.3?

Do not remove it. Change what is suggested and build and visualize the docs locally. I think versionadded part of docs:

Checkpoint Events
-----------------

.. versionadded:: 0.5.3

wont be rendered correctly. In this case just remove .. versionadded:: 0.5.3 and keep everything else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the CheckpointEvents docstring with the version directive and kept the RST documentation without the versionadded line as suggested. The local docs build hits a Windows-specific environment issue at 97% completion (not related to our changes - the RST parsing completed successfully).However, the documentation syntax appears correct based on the successful parsing before the build failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CI docs module: engine Engine module module: handlers Core Handlers module module: metrics Metrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new events
2 participants