-
Notifications
You must be signed in to change notification settings - Fork 2
SonicRetry combined PR[CMSSW_15_1_0_pre6] #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
SonicRetry combined PR[CMSSW_15_1_0_pre6] #24
Conversation
…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.
…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.
…tests; remove old cfg
…lection; remove unused parameters and improve documentation.
if (client_) { | ||
client_->evaluate(); | ||
} else { | ||
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an exception rather than a LogError. (It may actually need to be a return false
or similar, because the call chain is client->finish() -> action->retry() -> action->eval()
, and only finish()
should actually emit an exception.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has not been addressed yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be just return
, similar to what client->evaluate
does, if we want only finish()
to throw exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we need to let finish()
know that the eval()
failed somehow... I guess it could directly call finish(false)
as is done in evaluate()
/handle_exception()
. we may want to rethink this pattern in general, though, as I am becoming concerned that it will lead to excessive stack depth.
HeterogeneousCore/SonicTriton/test/retry_action_diff_log_test.sh
Outdated
Show resolved
Hide resolved
if (client_) { | ||
client_->evaluate(); | ||
} else { | ||
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has not been addressed yet
//todo: use some algorithm to select server rather than just picking arbitrarily | ||
const auto& server(servers_.find(serverName)->second); | ||
return server; | ||
const auto serverPair = servers_.find(serverName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, this should be serverPairIter
or similar (to indicate this is an iterator pointing to the pair<name,server> object, not the object itself)
std::unique_ptr<tc::InferenceServerGrpcClient> client; | ||
TRITON_THROW_IF_ERROR( | ||
tc::InferenceServerGrpcClient::Create(&client, server.url, false, server.useSsl, server.sslOptions), | ||
"TritonService(): unable to create inference context for " + serverName + " (" + server.url + ")", | ||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since these exact lines now occur twice in this file, there should probably be a dedicated helper function.
client->IsServerLive(&live); | ||
client->IsServerReady(&ready); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these remain false, will the subsequent server calls below fail?
ServerHealth& health = acc->second; | ||
health.live = live; | ||
health.ready = ready; | ||
health.failureCount = failures; | ||
health.avgQueueTimeMs = avgQueueTimeMs / queue_count; | ||
health.avgInferTimeMs = avgInferTimeMs / infer_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there may need to be additional logic behind these updates. right now, everything is just overwritten with the latest status of the server. but we may actually want to track some of these from the client's perspective: e.g. how many times in the last [time interval] did this server return live = false
? how many times did calls fail from within this job? etc. (this may imply more quantities to track)
|
||
} catch (const TritonException& e) { | ||
// mark existing entry unhealthy if present | ||
tbb::concurrent_hash_map<std::string, ServerHealth>::accessor acc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef: using ServerHealthIter = tbb::concurrent_hash_map<std::string, ServerHealth>::accessor
or similar
health.ready = false; | ||
} | ||
} catch (const std::exception& e) { | ||
// fallback for other exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this does the same thing as the previous catch
, they could be combined in one block (unless we eventually expect them to diverge)
if (!health.live || !health.ready) | ||
continue; // skip unhealthy | ||
|
||
// Select server according to rules: | ||
// 1) lowest failureCount | ||
// 2) tie-breaker: lowest avgQueueTimeMs | ||
if (!bestServerName || health.failureCount < bestHealth.failureCount || | ||
(health.failureCount == bestHealth.failureCount && health.avgQueueTimeMs < bestHealth.avgQueueTimeMs)) { | ||
bestServerName = serverName; | ||
bestHealth = health; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic should be encapsulated into a function, or even a plugin, so that different schemes can be specified
} | ||
} | ||
if (verbose_ && bestServerName) { | ||
edm::LogInfo("Chosen server for model '" + modelName + "': " + *bestServerName + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be:
edm::LogInfo("TritonService") << [text];
edm::LogWarning("RetryActionDiffServer") | ||
<< "No alternative server found for model " << tritonClient->modelName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be more than a warning; execution should stop in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but see next comment)
|
||
std::optional<std::string> TritonService::getBestServer(const std::string& modelName, | ||
const std::string& IgnoreServer) { | ||
std::optional<std::string> bestServerName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the use of optional
here. If no remote server is found, the fallback should be used. If for some reason the fallback cannot be found or used, then execution should stop here.
…in the review (fastmachinelearning#24) Address review comments + fixes for TDR regionizer (not used in default CMSSW configuration)
Rebased #23 to CMSSW_15_1_0_pre6.
PR description:
RetryActionDiffServer
for SonicTriton usingTritonService
’s server registry; remove per-action alternate server parameters.TritonClient::updateServer(TritonService::Server::fallbackName)
to switch servers on retry, per review guidance.HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc
(arms → updateServer(fallback) → no-op on second retry → exception path is caught).HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py
with--retryAction {same,diff}
and a verbose confirmation line.TestHeterogeneousCoreSonicTritonRetryActionDiff_Log
to assert the selected retry policy.PR validation:
Built and ran unit/integration tests in
CMSSW_15_1_0_pre6
area:scram b -j 8
TODO/To verify:
scram b runtests TEST=HeterogeneousCore/SonicTriton
(passes)Client.Retry
is explicitly configured.