Skip to content

Conversation

aybchan
Copy link
Member

@aybchan aybchan commented Jul 15, 2025

No description provided.

aybchan and others added 29 commits June 26, 2025 14:50
This meant that ~40-80GB GPUs would run >4 parallel jobs.
Add `GKE` `MaxText` train ([example
run](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15744603099/job/44379358307))
and `NCCL` test ([example
run](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15744603099/job/44378422712))
workflows with reusable composite action for managing `xpk` job
lifecycle (launch, logs streaming, clean up, artifact upload).

Patches on `xpk` address the following identified issues:
- AI-Hypercomputer/xpk#476
- AI-Hypercomputer/xpk#488
- AI-Hypercomputer/xpk#490
- AI-Hypercomputer/xpk#491
- AI-Hypercomputer/xpk#492

Cluster create with `xpk` ([example
run](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15591134618/job/43910254644#step:5:1))
- added as a separate
[workflow](https://github.com/NVIDIA/JAX-Toolbox/pull/1481/files#diff-801fc28cafbf1e0fa0ea521355fa8a1c9e6c01dcb8b1083c47f66e2ead4d560a)
for demonstration purposes (will not be operational in the CI)

---------

Co-authored-by: Olli Lupton <[email protected]>
Upgrade werkzeug to avoid vulnerabilities in 2.0.3. To be able to do
that, google-cloud-aiplatform needs to at least >= 1.90.0 (refer to
https://github.com/googleapis/python-aiplatform/blob/v1.90.0/setup.py#L51)
This helps CUDA forward compatibility work when spawning processes over
SSH, as those processes do not see environment variables set by the
container entrypoint that handles forward compatibility.
`/usr/local/cuda/compat/lib` will only exist if the entrypoint detects
that forward compatibility mode is enabled.
They just so happened to get upgraded on July 31st together, but
- orbax-checkpoint 0.11.20 has issues without internal checkpoint
testing
- pip-tools 7.5.0 will cause `ValueError: '/opt/maxtext/requirements.txt
(line 1)' is not in the subpath of '/opt/pip-tools.d'`. I'm guessing
something is not quite compatible with the Python 3.12 we current have
in the base container. Theoretically, `-r ../maxtext/requirements.txt`
should work, but since we are using a specific version of pip. Let's
play safe at this point and use 7.4.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants