SonicRetry combined PR[CMSSW_15_1_0_pre6] #24

kakwok · 2025-09-15T14:15:02Z

Rebased #23 to CMSSW_15_1_0_pre6.

PR description:

Implement RetryActionDiffServer for SonicTriton using TritonService’s server registry; remove per-action alternate server parameters.
Use TritonClient::updateServer(TritonService::Server::fallbackName) to switch servers on retry, per review guidance.
Extend test coverage:
- Add Catch2 unit test HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc (arms → updateServer(fallback) → no-op on second retry → exception path is caught).
- Extend HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py with --retryAction {same,diff} and a verbose confirmation line.
- Add scripted test TestHeterogeneousCoreSonicTritonRetryActionDiff_Log to assert the selected retry policy.
Move all test definitions to HeterogeneousCore/SonicTriton/test/BuildFile.xml (remove package-level test entries; use proper with catch2).
Provide a protected default constructor for TritonClient (test double support), removing the unused testing flag constructor.
Aligns with the Retry framework expectations discussed in review (Test PR for new Retry Framework #19 ); no physics changes expected, functionality is exercised only when retry is configured.

PR validation:

Built and ran unit/integration tests in CMSSW_15_1_0_pre6 area:
- scram b -j 8
TODO/To verify:
- scram b runtests TEST=HeterogeneousCore/SonicTriton (passes)
- Catch2 test validates action behavior;
- cmsRun-based tests validate configuration wiring for both same and diff policies.
- No changes to standard workflows unless Client.Retry is explicitly configured.

…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.

…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.

…tests; remove old cfg

…lection; remove unused parameters and improve documentation.

HeterogeneousCore/SonicTriton/interface/TritonClient.h

HeterogeneousCore/SonicTriton/src/TritonClient.cc

HeterogeneousCore/SonicCore/src/SonicClientBase.cc

HeterogeneousCore/SonicCore/plugins/RetrySameServerAction.cc

kpedro88 · 2025-09-15T16:23:05Z

HeterogeneousCore/SonicCore/src/RetryActionBase.cc

+  if (client_) {
+    client_->evaluate();
+  } else {
+    edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";


This should be an exception rather than a LogError. (It may actually need to be a return false or similar, because the call chain is client->finish() -> action->retry() -> action->eval(), and only finish() should actually emit an exception.)

This comment has not been addressed yet

It should be just return, similar to what client->evaluate does, if we want only finish() to throw exception

But we need to let finish() know that the eval() failed somehow... I guess it could directly call finish(false) as is done in evaluate()/handle_exception(). we may want to rethink this pattern in general, though, as I am becoming concerned that it will lead to excessive stack depth.

HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py

kpedro88 · 2025-09-17T13:34:52Z

HeterogeneousCore/SonicTriton/src/RetryActionDiffServer.cc

+  } catch (std::exception& e) {
+    edm::LogError("RetryActionDiffServer") << "Failed to retry with alternative server: " << e.what();
+  } catch (...) {
+    edm::LogError("RetryActionDiffServe: rUnknownFailure") << "An unknown exception was thrown";


kpedro88 · 2025-09-17T13:37:04Z

HeterogeneousCore/SonicTriton/test/retry_action_diff_log_test.sh

@@ -0,0 +1,22 @@
+#!/bin/bash


this comment has not been addressed

kpedro88 · 2025-09-17T13:37:43Z

HeterogeneousCore/SonicCore/src/RetryActionBase.cc

+  if (client_) {
+    client_->evaluate();
+  } else {
+    edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";


This comment has not been addressed yet

HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py

Martin and others added 13 commits September 12, 2025 13:31

RetryAction compiles

71fc1e1

Include RetryAction in SonicClientBase

c111b04

Update PR comments

7d33515

PR comments, fix fillDescriptions

36e0e68

rebase 15_1_0_pre6

5d498b1

Add update server function for client

70e7b26

Add test for Triton retry action in BuildFile.xml

cd7ee5e

Implement retry logic in RetryActionDiffServer and add connectToServe…

b41f499

…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.

Add RetryActionDiffServer class documentation, implement testing cons…

e7b9004

…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.

SonicTriton: implement retry action against different server; update …

6336d0b

…tests; remove old cfg

Refactor RetryActionDiffServer to utilize TritonService for server se…

29414ad

…lection; remove unused parameters and improve documentation.

rebase 15_1_0_pre6

acfb4c9

Fixes to compile

a3a3811

kpedro88 reviewed Sep 15, 2025

View reviewed changes

Martin added 2 commits September 16, 2025 10:22

PR comments

71b2349

Move retry options to customize.py

7dc9e71

kpedro88 reviewed Sep 17, 2025

View reviewed changes

Martin added 4 commits September 17, 2025 18:47

more clean ups

49c9550

First draft for server health

318634f

remove redundant

34642d7

Use getBestServer in RetryDiffServerAction

db3f079

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SonicRetry combined PR[CMSSW_15_1_0_pre6] #24

SonicRetry combined PR[CMSSW_15_1_0_pre6] #24

Uh oh!

kakwok commented Sep 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpedro88 Sep 15, 2025

Uh oh!

kpedro88 Sep 17, 2025

Uh oh!

kakwok Sep 18, 2025

Uh oh!

kpedro88 Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpedro88 Sep 17, 2025

Uh oh!

kpedro88 Sep 17, 2025

Uh oh!

kpedro88 Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SonicRetry combined PR[CMSSW_15_1_0_pre6] #24

Are you sure you want to change the base?

SonicRetry combined PR[CMSSW_15_1_0_pre6] #24

Uh oh!

Conversation

kakwok commented Sep 15, 2025

PR description:

PR validation:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpedro88 Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

kpedro88 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

kakwok Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

kpedro88 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpedro88 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

kpedro88 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

kpedro88 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!