Currently nemo-rl always tries to resume from the last checkpoint in the checkpoint path. When we change the policy model, the new model will fail silently at loading old checkpoints, resulting in two negative consequences:
- New checkpoints will overwrite old checkpoints from a different model.
- The training step is counted from the old checkpoint, even if the new model is actually trained from scratch.
I feel it's better to fail explicitly when the policy model doesn't match the checkpoints, to prevent such undefined behaviors.