Skip to content

Conversation

kakwok
Copy link

@kakwok kakwok commented Sep 15, 2025

Rebased #23 to CMSSW_15_1_0_pre6.

PR description:

  • Implement RetryActionDiffServer for SonicTriton using TritonService’s server registry; remove per-action alternate server parameters.
  • Use TritonClient::updateServer(TritonService::Server::fallbackName) to switch servers on retry, per review guidance.
  • Extend test coverage:
    • Add Catch2 unit test HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc (arms → updateServer(fallback) → no-op on second retry → exception path is caught).
    • Extend HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py with --retryAction {same,diff} and a verbose confirmation line.
    • Add scripted test TestHeterogeneousCoreSonicTritonRetryActionDiff_Log to assert the selected retry policy.
  • Move all test definitions to HeterogeneousCore/SonicTriton/test/BuildFile.xml (remove package-level test entries; use proper with catch2).
  • Provide a protected default constructor for TritonClient (test double support), removing the unused testing flag constructor.
  • Aligns with the Retry framework expectations discussed in review (Test PR for new Retry Framework #19 ); no physics changes expected, functionality is exercised only when retry is configured.

PR validation:

  • Built and ran unit/integration tests in CMSSW_15_1_0_pre6 area:

    • scram b -j 8
  • TODO/To verify:

    • scram b runtests TEST=HeterogeneousCore/SonicTriton (passes)
    • Catch2 test validates action behavior;
    • cmsRun-based tests validate configuration wiring for both same and diff policies.
    • No changes to standard workflows unless Client.Retry is explicitly configured.

if (client_) {
client_->evaluate();
} else {
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an exception rather than a LogError. (It may actually need to be a return false or similar, because the call chain is client->finish() -> action->retry() -> action->eval(), and only finish() should actually emit an exception.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed yet

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be just return, similar to what client->evaluate does, if we want only finish() to throw exception

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to let finish() know that the eval() failed somehow... I guess it could directly call finish(false) as is done in evaluate()/handle_exception(). we may want to rethink this pattern in general, though, as I am becoming concerned that it will lead to excessive stack depth.

} catch (std::exception& e) {
edm::LogError("RetryActionDiffServer") << "Failed to retry with alternative server: " << e.what();
} catch (...) {
edm::LogError("RetryActionDiffServe: rUnknownFailure") << "An unknown exception was thrown";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -0,0 +1,22 @@
#!/bin/bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment has not been addressed

if (client_) {
client_->evaluate();
} else {
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants