Skip to content

Conversation

ChrisPaulBennett
Copy link
Contributor

@ChrisPaulBennett ChrisPaulBennett commented Mar 12, 2025

This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.

This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.

This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.

Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@ChrisPaulBennett ChrisPaulBennett marked this pull request as draft March 12, 2025 09:19
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Changed the name of the profiler module.
Linting

Profiler sends KB instead of bytes

Time Series now working

CPU/Memory Logging working
@ChrisPaulBennett ChrisPaulBennett force-pushed the cylc_profiler branch 2 times, most recently from 4a6dbe8 to 30a7bb0 Compare April 1, 2025 15:31
ChrisPaulBennett and others added 19 commits April 2, 2025 09:34
Initial profiler implementation (non working)

Changed the name of the profiler module.
Linting

Profiler sends KB instead of bytes

Time Series now working

CPU/Memory Logging working

Adding profiler unit tests

updating tests

Fail gracefully if cgroups cannot be found

Revert "Fail gracefully if cgroups cannot be found"

This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e.

Linting

Modifying unit tests
Linting
Changed the name of the profiler module.

Profiler sends KB instead of bytes

Time Series now working
* The COPYING file appears to have moved from `dist-info/COPYING` into
  `dist-info/licenses/COPYING`.
* Attempt to fix flaky test.
* Cut out shell profile files to omit some spurious stderr.
* Attempt to fix flaky test.
* Cut out shell profile files to omit some spurious stderr.
Revert "Fail gracefully if cgroups cannot be found"

This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e.

Adding profiler unit tests

updating tests
###############################################################################
# Save the data using cylc message and exit the profiler
cylc__kill_profiler() {
if [[ -n "${profiler_pid:-}" && -d "/proc/${profiler_pid}" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The POSIX standard is a set of interfaces that all compliant operating systems must follow (Mac OS included).

The /proc filesystem ain't POSIX, it's a Linux specific thing, hence the fun you had with Mac OS testing.

The most POSIX way to determine if a process is running that I could find is ps -p:

Suggested change
if [[ -n "${profiler_pid:-}" && -d "/proc/${profiler_pid}" ]]; then
if [[ -n "${profiler_pid:-}" ]] && ps -p "$profiler_pid" >/dev/null;

You can see the POSIX description for the ps command here:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/ps.html

The index of commands is here:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/

Comment on lines +38 to +41
max_rss_location = None
cpu_time_location = None
cgroup_version = None
comms_timeout = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused global variables:

Suggested change
max_rss_location = None
cpu_time_location = None
cgroup_version = None
comms_timeout = None

Comment on lines +64 to +70
global comms_timeout
# Register the stop_profiler function with the signal library
signal.signal(signal.SIGINT, stop_profiler)
signal.signal(signal.SIGHUP, stop_profiler)
signal.signal(signal.SIGTERM, stop_profiler)

comms_timeout = options.comms_timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused global variables:

Suggested change
global comms_timeout
# Register the stop_profiler function with the signal library
signal.signal(signal.SIGINT, stop_profiler)
signal.signal(signal.SIGHUP, stop_profiler)
signal.signal(signal.SIGTERM, stop_profiler)
comms_timeout = options.comms_timeout
# Register the stop_profiler function with the signal library
signal.signal(signal.SIGINT, stop_profiler)
signal.signal(signal.SIGHUP, stop_profiler)
signal.signal(signal.SIGTERM, stop_profiler)

Comment on lines +151 to +161
# HPC uses cgroups v2 and SPICE uses cgroups v1
global cgroup_version
if Path.exists(Path(cgroup_location + cgroup_name)):
cgroup_version = 2
return cgroup_version
elif Path.exists(Path(cgroup_location + "/memory" + cgroup_name)):
cgroup_version = 1
return cgroup_version
else:
raise FileNotFoundError("Cgroup not found at " +
cgroup_location + cgroup_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused global variables:

(also Met Office specific comment)

Suggested change
# HPC uses cgroups v2 and SPICE uses cgroups v1
global cgroup_version
if Path.exists(Path(cgroup_location + cgroup_name)):
cgroup_version = 2
return cgroup_version
elif Path.exists(Path(cgroup_location + "/memory" + cgroup_name)):
cgroup_version = 1
return cgroup_version
else:
raise FileNotFoundError("Cgroup not found at " +
cgroup_location + cgroup_name)
if Path.exists(Path(cgroup_location + cgroup_name)):
return 2
elif Path.exists(Path(cgroup_location + "/memory" + cgroup_name)):
return 1
else:
raise FileNotFoundError("Cgroup not found at " +
cgroup_location + cgroup_name)

def get_cgroup_name():
"""Get the cgroup directory for the current process"""

# fugly hack to allow functional tests to use test data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

Comment on lines +188 to +208
global max_rss_location
global cpu_time_location
if version == 2:
max_rss_location = location + name + "/" + "memory.peak"
cpu_time_location = location + name + "/" + "cpu.stat"
return Process(
cgroup_memory_path=location +
name + "/" + "memory.peak",
cgroup_cpu_path=location +
name + "/" + "cpu.stat")

elif version == 1:
max_rss_location = (location + "/memory" +
name + "/memory.max_usage_in_bytes")
cpu_time_location = (location + "/cpu" +
name + "/cpuacct.usage")
return Process(
cgroup_memory_path=location + "/memory" +
name + "/memory.max_usage_in_bytes",
cgroup_cpu_path=location + "/cpu" +
name + "/cpuacct.usage")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused global variables:

Suggested change
global max_rss_location
global cpu_time_location
if version == 2:
max_rss_location = location + name + "/" + "memory.peak"
cpu_time_location = location + name + "/" + "cpu.stat"
return Process(
cgroup_memory_path=location +
name + "/" + "memory.peak",
cgroup_cpu_path=location +
name + "/" + "cpu.stat")
elif version == 1:
max_rss_location = (location + "/memory" +
name + "/memory.max_usage_in_bytes")
cpu_time_location = (location + "/cpu" +
name + "/cpuacct.usage")
return Process(
cgroup_memory_path=location + "/memory" +
name + "/memory.max_usage_in_bytes",
cgroup_cpu_path=location + "/cpu" +
name + "/cpuacct.usage")
if version == 2:
return Process(
cgroup_memory_path=location +
name + "/" + "memory.peak",
cgroup_cpu_path=location +
name + "/" + "cpu.stat")
elif version == 1:
return Process(
cgroup_memory_path=location + "/memory" +
name + "/memory.max_usage_in_bytes",
cgroup_cpu_path=location + "/cpu" +
name + "/cpuacct.usage")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These global variables are used. They are used in the stop_profiler function.
I can't see a way to pass it arguments when it is called by the signal library. Using globals is the only way I can see to do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants