Skip to content

Conversation

kakwok
Copy link

@kakwok kakwok commented Sep 15, 2025

Rebased #23 to CMSSW_15_1_0_pre6.

PR description:

  • Implement RetryActionDiffServer for SonicTriton using TritonService’s server registry; remove per-action alternate server parameters.
  • Use TritonClient::updateServer(TritonService::Server::fallbackName) to switch servers on retry, per review guidance.
  • Extend test coverage:
    • Add Catch2 unit test HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc (arms → updateServer(fallback) → no-op on second retry → exception path is caught).
    • Extend HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py with --retryAction {same,diff} and a verbose confirmation line.
    • Add scripted test TestHeterogeneousCoreSonicTritonRetryActionDiff_Log to assert the selected retry policy.
  • Move all test definitions to HeterogeneousCore/SonicTriton/test/BuildFile.xml (remove package-level test entries; use proper with catch2).
  • Provide a protected default constructor for TritonClient (test double support), removing the unused testing flag constructor.
  • Aligns with the Retry framework expectations discussed in review (Test PR for new Retry Framework #19 ); no physics changes expected, functionality is exercised only when retry is configured.

PR validation:

  • Built and ran unit/integration tests in CMSSW_15_1_0_pre6 area:

    • scram b -j 8
  • TODO/To verify:

    • scram b runtests TEST=HeterogeneousCore/SonicTriton (passes)
    • Catch2 test validates action behavior;
    • cmsRun-based tests validate configuration wiring for both same and diff policies.
    • No changes to standard workflows unless Client.Retry is explicitly configured.

if (client_) {
client_->evaluate();
} else {
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an exception rather than a LogError. (It may actually need to be a return false or similar, because the call chain is client->finish() -> action->retry() -> action->eval(), and only finish() should actually emit an exception.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed yet

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be just return, similar to what client->evaluate does, if we want only finish() to throw exception

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need to let finish() know that the eval() failed somehow... I guess it could directly call finish(false) as is done in evaluate()/handle_exception(). we may want to rethink this pattern in general, though, as I am becoming concerned that it will lead to excessive stack depth.

if (client_) {
client_->evaluate();
} else {
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed yet

//todo: use some algorithm to select server rather than just picking arbitrarily
const auto& server(servers_.find(serverName)->second);
return server;
const auto serverPair = servers_.find(serverName);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, this should be serverPairIter or similar (to indicate this is an iterator pointing to the pair<name,server> object, not the object itself)

Comment on lines +254 to +258
std::unique_ptr<tc::InferenceServerGrpcClient> client;
TRITON_THROW_IF_ERROR(
tc::InferenceServerGrpcClient::Create(&client, server.url, false, server.useSsl, server.sslOptions),
"TritonService(): unable to create inference context for " + serverName + " (" + server.url + ")",
false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these exact lines now occur twice in this file, there should probably be a dedicated helper function.

Comment on lines +261 to +262
client->IsServerLive(&live);
client->IsServerReady(&ready);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these remain false, will the subsequent server calls below fail?

Comment on lines +292 to +297
ServerHealth& health = acc->second;
health.live = live;
health.ready = ready;
health.failureCount = failures;
health.avgQueueTimeMs = avgQueueTimeMs / queue_count;
health.avgInferTimeMs = avgInferTimeMs / infer_count;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may need to be additional logic behind these updates. right now, everything is just overwritten with the latest status of the server. but we may actually want to track some of these from the client's perspective: e.g. how many times in the last [time interval] did this server return live = false? how many times did calls fail from within this job? etc. (this may imply more quantities to track)


} catch (const TritonException& e) {
// mark existing entry unhealthy if present
tbb::concurrent_hash_map<std::string, ServerHealth>::accessor acc;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typedef: using ServerHealthIter = tbb::concurrent_hash_map<std::string, ServerHealth>::accessor or similar

health.ready = false;
}
} catch (const std::exception& e) {
// fallback for other exceptions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this does the same thing as the previous catch, they could be combined in one block (unless we eventually expect them to diverge)

Comment on lines +339 to +349
if (!health.live || !health.ready)
continue; // skip unhealthy

// Select server according to rules:
// 1) lowest failureCount
// 2) tie-breaker: lowest avgQueueTimeMs
if (!bestServerName || health.failureCount < bestHealth.failureCount ||
(health.failureCount == bestHealth.failureCount && health.avgQueueTimeMs < bestHealth.avgQueueTimeMs)) {
bestServerName = serverName;
bestHealth = health;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic should be encapsulated into a function, or even a plugin, so that different schemes can be specified

}
}
if (verbose_ && bestServerName) {
edm::LogInfo("Chosen server for model '" + modelName + "': " + *bestServerName +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be:

edm::LogInfo("TritonService") << [text];

Comment on lines +32 to +33
edm::LogWarning("RetryActionDiffServer")
<< "No alternative server found for model " << tritonClient->modelName();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be more than a warning; execution should stop in this case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but see next comment)


std::optional<std::string> TritonService::getBestServer(const std::string& modelName,
const std::string& IgnoreServer) {
std::optional<std::string> bestServerName;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the use of optional here. If no remote server is found, the fallback should be used. If for some reason the fallback cannot be found or used, then execution should stop here.

trevin-lee pushed a commit to trevin-lee/cmssw that referenced this pull request Sep 27, 2025
…in the review (fastmachinelearning#24)

Address review comments + fixes for TDR regionizer (not used in default CMSSW configuration)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants