-
Notifications
You must be signed in to change notification settings - Fork 2
Add nightly scale tests for self-hosted runners #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
.github/workflows/stress-test.yml
Outdated
name: Self-hosted Runners Nightly Stress Test | ||
on: | ||
schedule: | ||
# Triggers at 11pm every night. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, can you leave the timezone in the comment to make it obvious for anyone reviewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the timezone comment. It's in UST so I changed the time to make it run at 11pm PST.
.github/workflows/stress-test.yml
Outdated
on: | ||
schedule: | ||
# Triggers at 11pm every night. | ||
- cron: '0 23 * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding a workflow_dispatch to this as well for testing and manual invocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
.github/workflows/stress-test.yml
Outdated
@@ -0,0 +1,35 @@ | |||
# Nightly Stress Test for self-hosted runners | |||
name: Self-hosted Runners Nightly Stress Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "stress" maybe we should call this a scale test. We don't actually do anything with the runners which makes me lean a little away from calling it stress long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to scale.
.github/workflows/stress-test.yml
Outdated
shell: bash -ex {0} | ||
steps: | ||
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # ratchet:actions/checkout@v4 | ||
- name: Install JAX test requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see value in leaving the install of test requirements if we don't end up using them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it since it's not being used.
runs-on: ${{ matrix.runners }} | ||
container: | ||
image: ${{ (contains(matrix.runners, 't2a') && 'us-central1-docker.pkg.dev/tensorflow-sigs/tensorflow/build-arm64:jax-latest-multi-python') || 'index.docker.io/tensorflow/build@sha256:7fb38f0319bda36393cad7f40670aa22352b44421bb906f5cf34d543acd8e1d2' }} | ||
timeout-minutes: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know, does this timeout include the initialization of the runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't because it's supposed to be the execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case its fine
Thanks @MichaelHudgins. PTAL. |
When run under an optimized build and Python 3.13.2t, I saw the following high probability crash in lax_control_flow_test: ``` Stack trace of thread 3526917: #0 0x00007f0898c4bf91 dump_frame (libpython3.13t.so.1.0 + 0x24bf91) #1 0x00007f0898c4b73f dump_traceback (libpython3.13t.so.1.0 + 0x24b73f) #2 0x00007f0898c4b86f _Py_DumpTracebackThreads (libpython3.13t.so.1.0 + 0x24b86f) #3 0x00007f0898cd4fe0 faulthandler_dump_traceback (libpython3.13t.so.1.0 + 0x2d4fe0) #4 0x00007f0898cd4f44 faulthandler_fatal_error (libpython3.13t.so.1.0 + 0x2d4f44) #5 0x00007f0898849e20 __restore_rt (libc.so.6 + 0x3fe20) #6 0x00007f07eb80e493 _ZNSt8__detail16_Hashtable_allocISaINS_10_Hash_nodeISt4pairIKN3jax15WeakrefLRUCache15WeakrefCacheKeyENS4_17WeakrefCacheValueEELb1EEEEE18_M_deallocate_nodeEPS9_ (libjax_common.so + 0x2c0e493) #7 0x00007f07eb80e13e _ZN3jax15WeakrefLRUCache5ClearEv (libjax_common.so + 0x2c0e13e) #8 0x00007f07eb812e37 _ZZN8nanobind6detail11func_createILb0ELb1EZNS_16cpp_function_defIN3jax15WeakrefLRUCacheEvS4_JEJNS_5scopeENS_4nameENS_9is_methodENS_9lock_selfEEEEvMT1_FT0_DpT2_EDpRKT3_EUlPS4_E_vJSJ_EJLm0EEJS5_S6_S7_S8_EEEP> #9 0x00007f07eb7fff70 _ZN8nanobind6detailL25nb_func_vectorcall_simpleEP7_objectPKS2_mS2_ (libjax_common.so + 0x2bfff70) #10 0x00007f0898dbbdee _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x3bbdee) #11 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db) #12 0x00007f0898d1ee78 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ee78) #13 0x00007f0898dc0054 _PyVectorcall_Call (libpython3.13t.so.1.0 + 0x3c0054) #14 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db) #15 0x00007f0898d1e02c _PyObject_VectorcallDictTstate (libpython3.13t.so.1.0 + 0x31e02c) #16 0x00007f0898ed8e35 slot_tp_call (libpython3.13t.so.1.0 + 0x4d8e35) #17 0x00007f0898dbc312 _PyObject_MakeTpCall (libpython3.13t.so.1.0 + 0x3bc312) #18 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db) #19 0x00007f0898d1ef54 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ef54) #20 0x00007f0899094c1f thread_run (libpython3.13t.so.1.0 + 0x694c1f) #21 0x00007f0898fa0c58 pythread_wrapper (libpython3.13t.so.1.0 + 0x5a0c58) #22 0x00007f089889c103 start_thread (libc.so.6 + 0x92103) #23 0x00007f089891a7b8 __clone3 (libc.so.6 + 0x1107b8) ``` It appears that this is due to freeing Python objects during unordered_map::clear(), which may release the enclosing critical section (`nb::lock_self()` on the method). Fix this by deferring destruction of the both the keys and the values to after the map's destruction.
The test will run on all supported Linux self-hosted runners at 11pm each night. It will spin up 5 different runner instances.