-
Notifications
You must be signed in to change notification settings - Fork 96
CPU and Max RSS Analysis tools #6663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
fb1b12b
to
c5d30b3
Compare
Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working
4a6dbe8
to
30a7bb0
Compare
Initial profiler implementation (non working) Changed the name of the profiler module. Linting Profiler sends KB instead of bytes Time Series now working CPU/Memory Logging working Adding profiler unit tests updating tests Fail gracefully if cgroups cannot be found Revert "Fail gracefully if cgroups cannot be found" This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e. Linting Modifying unit tests Linting Changed the name of the profiler module. Profiler sends KB instead of bytes Time Series now working
Co-authored-by: Hilary James Oliver <[email protected]>
* The COPYING file appears to have moved from `dist-info/COPYING` into `dist-info/licenses/COPYING`.
Co-authored-by: Hilary James Oliver <[email protected]>
* Attempt to fix flaky test. * Cut out shell profile files to omit some spurious stderr.
Co-authored-by: Hilary James Oliver <[email protected]>
* Attempt to fix flaky test. * Cut out shell profile files to omit some spurious stderr.
Revert "Fail gracefully if cgroups cannot be found" This reverts commit 92e1e11c9b392b4742501d399f191f590814e95e. Adding profiler unit tests updating tests
############################################################################### | ||
# Save the data using cylc message and exit the profiler | ||
cylc__kill_profiler() { | ||
if [[ -n "${profiler_pid:-}" && -d "/proc/${profiler_pid}" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The POSIX standard is a set of interfaces that all compliant operating systems must follow (Mac OS included).
The /proc
filesystem ain't POSIX, it's a Linux specific thing, hence the fun you had with Mac OS testing.
The most POSIX way to determine if a process is running that I could find is ps -p
:
if [[ -n "${profiler_pid:-}" && -d "/proc/${profiler_pid}" ]]; then | |
if [[ -n "${profiler_pid:-}" ]] && ps -p "$profiler_pid" >/dev/null; |
You can see the POSIX description for the ps
command here:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/ps.html
The index of commands is here:
max_rss_location = None | ||
cpu_time_location = None | ||
cgroup_version = None | ||
comms_timeout = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused global variables:
max_rss_location = None | |
cpu_time_location = None | |
cgroup_version = None | |
comms_timeout = None |
global comms_timeout | ||
# Register the stop_profiler function with the signal library | ||
signal.signal(signal.SIGINT, stop_profiler) | ||
signal.signal(signal.SIGHUP, stop_profiler) | ||
signal.signal(signal.SIGTERM, stop_profiler) | ||
|
||
comms_timeout = options.comms_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused global variables:
global comms_timeout | |
# Register the stop_profiler function with the signal library | |
signal.signal(signal.SIGINT, stop_profiler) | |
signal.signal(signal.SIGHUP, stop_profiler) | |
signal.signal(signal.SIGTERM, stop_profiler) | |
comms_timeout = options.comms_timeout | |
# Register the stop_profiler function with the signal library | |
signal.signal(signal.SIGINT, stop_profiler) | |
signal.signal(signal.SIGHUP, stop_profiler) | |
signal.signal(signal.SIGTERM, stop_profiler) |
# HPC uses cgroups v2 and SPICE uses cgroups v1 | ||
global cgroup_version | ||
if Path.exists(Path(cgroup_location + cgroup_name)): | ||
cgroup_version = 2 | ||
return cgroup_version | ||
elif Path.exists(Path(cgroup_location + "/memory" + cgroup_name)): | ||
cgroup_version = 1 | ||
return cgroup_version | ||
else: | ||
raise FileNotFoundError("Cgroup not found at " + | ||
cgroup_location + cgroup_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused global variables:
(also Met Office specific comment)
# HPC uses cgroups v2 and SPICE uses cgroups v1 | |
global cgroup_version | |
if Path.exists(Path(cgroup_location + cgroup_name)): | |
cgroup_version = 2 | |
return cgroup_version | |
elif Path.exists(Path(cgroup_location + "/memory" + cgroup_name)): | |
cgroup_version = 1 | |
return cgroup_version | |
else: | |
raise FileNotFoundError("Cgroup not found at " + | |
cgroup_location + cgroup_name) | |
if Path.exists(Path(cgroup_location + cgroup_name)): | |
return 2 | |
elif Path.exists(Path(cgroup_location + "/memory" + cgroup_name)): | |
return 1 | |
else: | |
raise FileNotFoundError("Cgroup not found at " + | |
cgroup_location + cgroup_name) |
def get_cgroup_name(): | ||
"""Get the cgroup directory for the current process""" | ||
|
||
# fugly hack to allow functional tests to use test data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
global max_rss_location | ||
global cpu_time_location | ||
if version == 2: | ||
max_rss_location = location + name + "/" + "memory.peak" | ||
cpu_time_location = location + name + "/" + "cpu.stat" | ||
return Process( | ||
cgroup_memory_path=location + | ||
name + "/" + "memory.peak", | ||
cgroup_cpu_path=location + | ||
name + "/" + "cpu.stat") | ||
|
||
elif version == 1: | ||
max_rss_location = (location + "/memory" + | ||
name + "/memory.max_usage_in_bytes") | ||
cpu_time_location = (location + "/cpu" + | ||
name + "/cpuacct.usage") | ||
return Process( | ||
cgroup_memory_path=location + "/memory" + | ||
name + "/memory.max_usage_in_bytes", | ||
cgroup_cpu_path=location + "/cpu" + | ||
name + "/cpuacct.usage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused global variables:
global max_rss_location | |
global cpu_time_location | |
if version == 2: | |
max_rss_location = location + name + "/" + "memory.peak" | |
cpu_time_location = location + name + "/" + "cpu.stat" | |
return Process( | |
cgroup_memory_path=location + | |
name + "/" + "memory.peak", | |
cgroup_cpu_path=location + | |
name + "/" + "cpu.stat") | |
elif version == 1: | |
max_rss_location = (location + "/memory" + | |
name + "/memory.max_usage_in_bytes") | |
cpu_time_location = (location + "/cpu" + | |
name + "/cpuacct.usage") | |
return Process( | |
cgroup_memory_path=location + "/memory" + | |
name + "/memory.max_usage_in_bytes", | |
cgroup_cpu_path=location + "/cpu" + | |
name + "/cpuacct.usage") | |
if version == 2: | |
return Process( | |
cgroup_memory_path=location + | |
name + "/" + "memory.peak", | |
cgroup_cpu_path=location + | |
name + "/" + "cpu.stat") | |
elif version == 1: | |
return Process( | |
cgroup_memory_path=location + "/memory" + | |
name + "/memory.max_usage_in_bytes", | |
cgroup_cpu_path=location + "/cpu" + | |
name + "/cpuacct.usage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These global variables are used. They are used in the stop_profiler function.
I can't see a way to pass it arguments when it is called by the signal library. Using globals is the only way I can see to do it
This apart of 3 pull requests for adding CPU time and Max RSS analysis to the Cylc UI.
This adds the Max RSS and CPU time (as measured by cgroups) to the table view, box plot and time series views.
This adds a python profiler script. This profiler will will be ran by cylc in the same crgroup as the cylc task. It will periodically poll cgroups and save data to a file. Cylc will then store these values in the sql db file.
Linked to;
cylc/cylc-ui#2100
cylc/cylc-uiserver#675
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.