-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Add Github Actions workflow that runs on a self-hosted TPU VM runner. #13000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FYI the sample run is failing due to actual test failures. I'd like to check this in even if it's red initially (maybe it's a good thing there are no notifications yet :)), so we can start fixing + testing based on this and see how it goes. Also cc @yejingxin who's working on getting the tests green. |
schedule: | ||
- cron: "0 14 * * *" # daily at 7am PST | ||
workflow_dispatch: # allows triggering the workflow run manually | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to set permissions for the GitHub token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're already set to read-only via the project settings. Should I also set it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sufficient, but it's best practice to also include them here, in case someone modifies the project settings later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, done. I put contents: read
, that seems to be sufficient (https://github.com/skye/jax/actions/runs/3365397735/jobs/5580821741). It's unclear to me if it even needs that, or if it gets the current checkout "for free".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config-as-code is nice to have during code / audit reviews and incidence response, especially that it's part of the commit history... unlike the settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair
schedule: | ||
- cron: "0 14 * * *" # daily at 7am PST | ||
workflow_dispatch: # allows triggering the workflow run manually | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...
|
||
jobs: | ||
cloud-tpu-test: | ||
runs-on: v4-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's "strongly recommended" to include "self-hosted" here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I also added "tpu" cuz why not.
python-version: ["3.10"] # TODO(jakevdp): update to 3.11 when available. | ||
jaxlib-version: [latest, nightly] | ||
steps: | ||
- uses: actions/checkout@v3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google security policy requires specifying an explicit hash here. See https://opensource.google/documentation/reference/github/services#actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oo didn't know about this, done. (We need to update our other workflows too, but I'll do that as a separate PR)
pip uninstall -y jax jaxlib libtpu-nightly | ||
if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then | ||
pip install .[tpu] \ | ||
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh unrelated to this PR, but I'd actually be interested in how you decided to set this up. We used to parse the GitHub releases page, but they changed the format, so now we make a hacky not-index and are looking for what the best long-term solution is: iree-org/iree#9532
I see you're using -f
and --pre
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use --pre
when installing jaxlib nightlies, so pip picks up dev versions (the nightlies are released as jaxlib
dev versions). We generate the index file directly from the GCS bucket contents. We decided to do this because we host many large wheels so would quickly exceed pypi's size limits. Happy to answer more questions on this here or offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's take this offline :-)
- name: Run tests | ||
env: | ||
JAX_PLATFORMS: tpu,cpu | ||
run: python -m pytest --tb=short tests examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The brevity of this job is 👨🍳 💋
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I promise it'll get worse with time :)
@@ -0,0 +1 @@ | |||
ACTIONS_RUNNER_HOOK_JOB_STARTED=$HOME/jax/.github/workflows/self_hosted_runner_utils/validate_job.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you clone the whole jax repo onto the runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeop
@@ -0,0 +1,59 @@ | |||
name: Cloud TPU nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Here for threading]
You will find that getting notifications is way harder than it should be. This is our issue for it: iree-org/iree#9857. Would love to combine efforts on this though. No point reinventing solutions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gahh! I kinda knew about this but didn't realize how dire the situation was. @yashk2810 will be particularly unhappy. But yes, let's combine efforts, I would really love to figure out a workable solution for notifications.
As a potential workaround for email altogether, @jakevdp wrote this workflow that files an issue on failure: https://github.com/google/jax/blob/main/.github/workflows/upstream-nightly.yaml#L114
It has some fancy logic for reusing an existing open issue, so it doesn't end up spamming the issue tracker on repeated failures.
Alternatively, @alexalemi wrote this workflow for pinging Google Chat on new releases: https://github.com/google/jax/blob/main/.github/workflows/release-notification.yml
Presumably this could be reused to start a chat thread on CI failure. Everyone uses chat instead of email now right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...
return 1 | ||
} | ||
|
||
if ! is_contained "${GITHUB_EVENT_NAME}" "${ALLOWED_EVENTS[@]}"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add other sort of validation like branch names? The triggers don't accept arbitrary branches. The workflow_dispatch does but is limited to users with push, so low risk I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to allow arbitrary branches actually, because it's useful for testing the workflow (since we can't use PRs). We could force devs to setup a fork with their own runner for testing, but given it's low risk to keep as-is, I'd prefer to keep it semi-convenient.
FYI I just made a fairly large change to setup_runner.sh and the corresponding instructions in go/jax-self-hosted-runners to create a new Thanks all for your comments here and in the doc. I'm gonna take a pass at the PR comments now and will push a new commit. |
So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly. |
@@ -0,0 +1,61 @@ | |||
name: Cloud TPU nightly | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GMNGeoffrey stealing your idea of starting a thread.
So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly.
Huh. Well I think like the idea of having a separate user anyway (I actually meant to ask you about this with your setup, but you were OOO, and then I forgot :)). We don't need sudo access for this workflow (yet?), and we can always enable that later. But it does add complexity. So I'm leaning towards keeping as-is, but could easily be convinced otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less
Does anyone have further comments I should address before submitting this change? (I'd like to keep the various discussions going, but would also like to get this change in and iterate.) I'll probably submit in a few hours if I don't hear from anyone, so please ping me if you'd like more time to review. |
echo "Event type '${GITHUB_EVENT_NAME}' is not allowed on this runner. Aborting workflow." | ||
# clean up any nefarious stuff we may have fetched in job setup. | ||
cd ~/actions-runner/_work | ||
rm -rfv _actions/ _temp/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm not totally sure that this is all the directories that the fetch step could have populated. In our case, we're using ephemeral runners, so that matters much less, but something to investigate. The whole thing where this fetching happens before the pre-job hook is totally weird to me: actions/runner#1951
schedule: | ||
- cron: "0 14 * * *" # daily at 7am PST | ||
workflow_dispatch: # allows triggering the workflow run manually | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair
pip uninstall -y jax jaxlib libtpu-nightly | ||
if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then | ||
pip install .[tpu] \ | ||
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's take this offline :-)
@@ -0,0 +1,59 @@ | |||
name: Cloud TPU nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...
@@ -0,0 +1,61 @@ | |||
name: Cloud TPU nightly | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less
- cron: "0 14 * * *" # daily at 7am PST | ||
workflow_dispatch: # allows triggering the workflow run manually | ||
|
||
# Should also be set to read-only in the project settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to tell here what the "should" means here. I'd expand a bit.
# Should also be set to read-only in the project settings. | |
# This should also be set to read-only in the project settings, but it's nice to document and enforce the permissions here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
runs-on: [self-hosted, tpu, v4-8] | ||
defaults: | ||
run: | ||
shell: bash -l {0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on why login shells?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I copied it from another workflow :) I tried without specifying shell
at all and it seems to work, so I'm just gonna remove this.
steps: | ||
# https://opensource.google/documentation/reference/github/services#actions | ||
# mandates using a specific commit for non-Google actions. | ||
- uses: actions/checkout@93ea575cb5d8a053eaa0ac8fa3b40d7e05a33cc8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our workflows we've added a comment like # v3
to indicate what version this commit points to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. FYI @sethvargo pointed me to https://github.com/sethvargo/ratchet, which I wanna try to update all of our workflows (but I'm saving that for a separate change).
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@13ae5bb136fac2878aff31522b9efb785519f984 | ||
with: | ||
python-version: ${{ matrix.python-version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could cache the pip installs here
https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages
python-version: ${{ matrix.python-version }} | |
python-version: ${{ matrix.python-version }} | |
cache: 'pip' | |
cache-dependency-path: build/test-requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Like where is it cached? Right now everything is cached since it's not an ephemeral runner, maybe that's bad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GitHub actions cache. It actually might not work very well because you're not running in Azure right next to their caches
# GCP VM as `runner` using gcloud. Don't do that! | ||
sudo useradd runner -m | ||
|
||
# Find the latest actions-runner download |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want automatic updates here and not a pinned version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I thought we had to upgrade to the latest version within 30 days of it being released, but rereading https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners, I think that's only if you've opted out of auto-updates. Either way, it's gonna update itself. So it would be arguably be simpler to use a pinned version and just have it update, but I'm gonna leave it as-is just in case since it sounds like Github isn't interested in maintaining old versions for very long. I added a comment explaining it'll auto-update itself and linking to the policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if you turn off auto-update you have to pick up the new version within 30 days. I turned off auto-update and we manually manage bumping the pinned version. This is more work, but something something supply chain security (plus random breakages that happen at confusing times)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, basically the thing it says in the ratchet README :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrrrrmph I'm gonna risk it, I think the chances of us not updating within 30 days is high, especially since it looks like they update quite often? https://github.com/actions/runner/releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair. It is a bit annoying to remember to update
--name $runner_name | ||
|
||
# Setup pre-job hook | ||
cat ~/jax/.github/workflows/self_hosted_runner_utils/runner.env | envsubst >> ~/actions-runner/.env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oo that's a neat trick. So envsubst allows you to write env variables in your config file and then they get filled in here (like you have $HOME)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup!
echo "actions_runner_download: $actions_runner_download" | ||
|
||
# Run the rest of the setup as `runner` | ||
sudo -i -u runner bash -ex <<EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I didn't like heredoc'ing this and created a separate script file. Since you've already cloned the whole repo, you could do that here also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could! And in fact the logic here is suuuuper subtle because heredoc replaces variables from the containing environment, just as a double-quoted string would. You can use <<'EOF'
to get single-quote-like behavior, but I'm actually relying on the double-quote behavior to insert values like $jax_repo_url. IMO this is confusing because it looks like variables should be expanded according to runner
's env, not the containing one.
In fact, I realize I had a bug due to this subtle behavior! echo "@reboot $HOME/.../start_github_runner.sh" | crontab -
is wrong, it should be echo "@reboot \$HOME/..."
. Otherwise it expands to /home/skyewm
(or whoever's running the script) instead of /home/runner
. This worked because I check out the whole repo in order to run this file.
That said, I'm not gonna take your suggestion because I like having everything in one file :P I'll be sure to tell you if/when I regret this decision and split it up so you can tell me "I told you so". I will add a comment and fix the bug though.
# More or less copied from | ||
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config | ||
|
||
~/actions-runner/run.sh > /tmp/actions-runner.`date +"%Y%m%d-%H%M"`.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stderr also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done
# limitations under the License. | ||
|
||
# More or less copied from | ||
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this point, somewhat less :-) MIght want to tweak the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ("Heavily influenced by ...")
@@ -0,0 +1,77 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README file for all this might be nice. I guess you've got an internal google doc. Maybe link it from these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It mostly just links to the doc for now, but maybe I'll move more of it to be public.
Thanks @GMNGeoffrey for the thorough review! |
This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info. The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing. Testing: - Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180 - Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)
This was meant to be part of jax-ml#13000, oops
This was meant to be part of jax-ml#13000, oops
This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info.
The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing.
Testing: