-
Notifications
You must be signed in to change notification settings - Fork 1.5k
slurm: add process-level tags #21822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
slurm/datadog_checks/slurm/check.py
Outdated
| new_header = SCONTROL_TAG_MAPPING.get(header, f"slurm_{header.lower()}") | ||
| tags.append(f"{new_header}:{value}") | ||
|
|
||
| if new_header == "pid": | ||
| tags.extend(tagger.tag(f"process://{value}", tagger.ORCHESTRATOR)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing process tags
The new call to tagger.tag assumes the tagger returns a list, but the tagger returns None when it has no tags for an entity (e.g., when the process is not yet known or the companion Agent change is not deployed). Extending tags with None raises TypeError: 'NoneType' object is not iterable, which will crash the SLURM check on nodes where process tags are unavailable—precisely the scenario the description says should be harmless. Wrap the result in or [] (or check for None) before extending so the check keeps running when no tags exist.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tagger.tag will always return a list, but @codex address that feedback anyways just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use Codex here, create an environment for this repo.
|
The following files, which will be shipped with the agent, were modified in this PR and You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, consider removing the label. List of modified files that will be shipped with the agent |
|
@gjulianm Would you mind providing an example tag that is being added here? Or even better add it as a comment in the code? |
|
Added example tags as a comment |
What does this PR do?
This PR adds tagging support to the SLURM check, so that processes with GPU activity will get automatically tagged.
Tested in local slurm cluster.
Motivation
https://datadoghq.atlassian.net/browse/EBPF-885
Related agent PR DataDog/datadog-agent#42524 (integration will not fail if the agent PR is not merged, the agent will just return an empty tagset)
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged