Add Github Actions workflow that runs on a self-hosted TPU VM runner. #13000

skye · 2022-10-27T00:46:24Z

This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info.

The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing.

Testing:

Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180
Sample PR attempt: Test PR for triggering TPU VM presubmit skye/jax#5, https://github.com/skye/jax/actions/runs/3340271321/jobs/5530184281

skye · 2022-10-27T00:48:06Z

cc @will-cromar @GMNGeoffrey

skye · 2022-10-27T17:19:21Z

FYI the sample run is failing due to actual test failures. I'd like to check this in even if it's red initially (maybe it's a good thing there are no notifications yet :)), so we can start fixing + testing based on this and see how it goes.

Also cc @yejingxin who's working on getting the tests green.

haydentherapper · 2022-10-31T20:53:47Z

.github/workflows/cloud-tpu-ci-nightly.yml

+  schedule:
+    - cron: "0 14 * * *"  # daily at 7am PST
+  workflow_dispatch:  # allows triggering the workflow run manually
+


Do you want to set permissions for the GitHub token?

They're already set to read-only via the project settings. Should I also set it here?

That's sufficient, but it's best practice to also include them here, in case someone modifies the project settings later on.

Got it, done. I put contents: read, that seems to be sufficient (https://github.com/skye/jax/actions/runs/3365397735/jobs/5580821741). It's unclear to me if it even needs that, or if it gets the current checkout "for free".

I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...

config-as-code is nice to have during code / audit reviews and incidence response, especially that it's part of the commit history... unlike the settings.

GMNGeoffrey · 2022-11-01T00:02:04Z

.github/workflows/cloud-tpu-ci-nightly.yml

+  schedule:
+    - cron: "0 14 * * *"  # daily at 7am PST
+  workflow_dispatch:  # allows triggering the workflow run manually
+


I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...

GMNGeoffrey · 2022-11-01T00:06:07Z

.github/workflows/cloud-tpu-ci-nightly.yml

+
+jobs:
+  cloud-tpu-test:
+    runs-on: v4-8


I think it's "strongly recommended" to include "self-hosted" here:

https://docs.github.com/en/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow#using-self-hosted-runners-in-a-workflow

Done. I also added "tpu" cuz why not.

GMNGeoffrey · 2022-11-01T00:06:45Z

.github/workflows/cloud-tpu-ci-nightly.yml

+        python-version: ["3.10"]  # TODO(jakevdp): update to 3.11 when available.
+        jaxlib-version: [latest, nightly]
+    steps:
+      - uses: actions/checkout@v3


Google security policy requires specifying an explicit hash here. See https://opensource.google/documentation/reference/github/services#actions

oo didn't know about this, done. (We need to update our other workflows too, but I'll do that as a separate PR)

GMNGeoffrey · 2022-11-01T00:09:28Z

.github/workflows/cloud-tpu-ci-nightly.yml

+          pip uninstall -y jax jaxlib libtpu-nightly
+          if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then
+            pip install .[tpu] \
+              -f https://storage.googleapis.com/jax-releases/libtpu_releases.html


Oh unrelated to this PR, but I'd actually be interested in how you decided to set this up. We used to parse the GitHub releases page, but they changed the format, so now we make a hacky not-index and are looking for what the best long-term solution is: iree-org/iree#9532

I see you're using -f and --pre

We only use --pre when installing jaxlib nightlies, so pip picks up dev versions (the nightlies are released as jaxlib dev versions). We generate the index file directly from the GCS bucket contents. We decided to do this because we host many large wheels so would quickly exceed pypi's size limits. Happy to answer more questions on this here or offline.

Yeah let's take this offline :-)

GMNGeoffrey · 2022-11-01T00:11:11Z

.github/workflows/cloud-tpu-ci-nightly.yml

+      - name: Run tests
+        env:
+          JAX_PLATFORMS: tpu,cpu
+        run: python -m pytest --tb=short tests examples


The brevity of this job is 👨‍🍳 💋

I promise it'll get worse with time :)

GMNGeoffrey · 2022-11-01T00:11:45Z

.github/workflows/self_hosted_runner_utils/runner.env

@@ -0,0 +1 @@
+ACTIONS_RUNNER_HOOK_JOB_STARTED=$HOME/jax/.github/workflows/self_hosted_runner_utils/validate_job.sh


Did you clone the whole jax repo onto the runner?

GMNGeoffrey · 2022-11-01T00:14:09Z

.github/workflows/cloud-tpu-ci-nightly.yml

@@ -0,0 +1,59 @@
+name: Cloud TPU nightly


[Here for threading]

You will find that getting notifications is way harder than it should be. This is our issue for it: iree-org/iree#9857. Would love to combine efforts on this though. No point reinventing solutions here.

Gahh! I kinda knew about this but didn't realize how dire the situation was. @yashk2810 will be particularly unhappy. But yes, let's combine efforts, I would really love to figure out a workable solution for notifications.

As a potential workaround for email altogether, @jakevdp wrote this workflow that files an issue on failure: https://github.com/google/jax/blob/main/.github/workflows/upstream-nightly.yaml#L114
It has some fancy logic for reusing an existing open issue, so it doesn't end up spamming the issue tracker on repeated failures.

Alternatively, @alexalemi wrote this workflow for pinging Google Chat on new releases: https://github.com/google/jax/blob/main/.github/workflows/release-notification.yml
Presumably this could be reused to start a chat thread on CI failure. Everyone uses chat instead of email now right?

Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...

laurentsimon · 2022-11-01T11:02:32Z

.github/workflows/self_hosted_runner_utils/validate_job.sh

+  return 1
+}
+
+if ! is_contained "${GITHUB_EVENT_NAME}" "${ALLOWED_EVENTS[@]}"; then


Can we add other sort of validation like branch names? The triggers don't accept arbitrary branches. The workflow_dispatch does but is limited to users with push, so low risk I think

I'd prefer to allow arbitrary branches actually, because it's useful for testing the workflow (since we can't use PRs). We could force devs to setup a fork with their own runner for testing, but given it's low risk to keep as-is, I'd prefer to keep it semi-convenient.

skye · 2022-11-01T21:54:49Z

FYI I just made a fairly large change to setup_runner.sh and the corresponding instructions in go/jax-self-hosted-runners to create a new runner user, instead of whatever user happens to be running the setup commands. @hawkinsp suggested this to make sure the runner user doesn't have sudo permissions. This also seems like a more manageable choice in general. I also made more of the runner setup happen in the setup script now (since it should happen as the new runner user), which makes it more convenient to run but also more complicated to understand.

Thanks all for your comments here and in the doc. I'm gonna take a pass at the PR comments now and will push a new commit.

GMNGeoffrey · 2022-11-01T23:39:13Z

So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly.

skye · 2022-11-02T00:06:19Z

.github/workflows/cloud-tpu-ci-nightly.yml

@@ -0,0 +1,61 @@
+name: Cloud TPU nightly
+


@GMNGeoffrey stealing your idea of starting a thread.

So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly.

Huh. Well I think like the idea of having a separate user anyway (I actually meant to ask you about this with your setup, but you were OOO, and then I forgot :)). We don't need sudo access for this workflow (yet?), and we can always enable that later. But it does add complexity. So I'm leaning towards keeping as-is, but could easily be convinced otherwise.

Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less

skye · 2022-11-02T20:19:45Z

Does anyone have further comments I should address before submitting this change? (I'd like to keep the various discussions going, but would also like to get this change in and iterate.) I'll probably submit in a few hours if I don't hear from anyone, so please ping me if you'd like more time to review.

GMNGeoffrey · 2022-11-01T21:00:06Z

.github/workflows/self_hosted_runner_utils/validate_job.sh

+  echo "Event type '${GITHUB_EVENT_NAME}' is not allowed on this runner. Aborting workflow."
+  # clean up any nefarious stuff we may have fetched in job setup.
+  cd ~/actions-runner/_work
+  rm -rfv _actions/ _temp/


Note that I'm not totally sure that this is all the directories that the fetch step could have populated. In our case, we're using ephemeral runners, so that matters much less, but something to investigate. The whole thing where this fetching happens before the pre-job hook is totally weird to me: actions/runner#1951

GMNGeoffrey · 2022-11-03T00:24:25Z

.github/workflows/cloud-tpu-ci-nightly.yml

+  schedule:
+    - cron: "0 14 * * *"  # daily at 7am PST
+  workflow_dispatch:  # allows triggering the workflow run manually
+


GMNGeoffrey · 2022-11-03T00:24:39Z

.github/workflows/cloud-tpu-ci-nightly.yml

+          pip uninstall -y jax jaxlib libtpu-nightly
+          if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then
+            pip install .[tpu] \
+              -f https://storage.googleapis.com/jax-releases/libtpu_releases.html


Yeah let's take this offline :-)

GMNGeoffrey · 2022-11-03T00:26:15Z

.github/workflows/cloud-tpu-ci-nightly.yml

@@ -0,0 +1,59 @@
+name: Cloud TPU nightly


Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...

GMNGeoffrey · 2022-11-03T00:27:20Z

.github/workflows/cloud-tpu-ci-nightly.yml

@@ -0,0 +1,61 @@
+name: Cloud TPU nightly
+


Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less

GMNGeoffrey · 2022-11-03T00:29:02Z

.github/workflows/cloud-tpu-ci-nightly.yml

+    - cron: "0 14 * * *"  # daily at 7am PST
+  workflow_dispatch:  # allows triggering the workflow run manually
+
+# Should also be set to read-only in the project settings.


It's hard to tell here what the "should" means here. I'd expand a bit.

Suggested change

# Should also be set to read-only in the project settings.

# This should also be set to read-only in the project settings, but it's nice to document and enforce the permissions here.

GMNGeoffrey · 2022-11-03T00:29:20Z

.github/workflows/cloud-tpu-ci-nightly.yml

+    runs-on: [self-hosted, tpu, v4-8]
+    defaults:
+      run:
+        shell: bash -l {0}


Comment on why login shells?

I don't know, I copied it from another workflow :) I tried without specifying shell at all and it seems to work, so I'm just gonna remove this.

GMNGeoffrey · 2022-11-03T00:29:58Z

.github/workflows/cloud-tpu-ci-nightly.yml

+    steps:
+      # https://opensource.google/documentation/reference/github/services#actions
+      # mandates using a specific commit for non-Google actions.
+      - uses: actions/checkout@93ea575cb5d8a053eaa0ac8fa3b40d7e05a33cc8


In our workflows we've added a comment like # v3 to indicate what version this commit points to

Done. FYI @sethvargo pointed me to https://github.com/sethvargo/ratchet, which I wanna try to update all of our workflows (but I'm saving that for a separate change).

GMNGeoffrey · 2022-11-03T00:33:47Z

.github/workflows/cloud-tpu-ci-nightly.yml

+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@13ae5bb136fac2878aff31522b9efb785519f984
+        with:
+          python-version: ${{ matrix.python-version }}


You could cache the pip installs here

https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages

Suggested change

python-version: ${{ matrix.python-version }}

python-version: ${{ matrix.python-version }}

cache: 'pip'

cache-dependency-path: build/test-requirements.txt

How does this work? Like where is it cached? Right now everything is cached since it's not an ephemeral runner, maybe that's bad...

The GitHub actions cache. It actually might not work very well because you're not running in Azure right next to their caches

GMNGeoffrey · 2022-11-03T00:35:28Z

.github/workflows/self_hosted_runner_utils/setup_runner.sh

+# GCP VM as `runner` using gcloud. Don't do that!
+sudo useradd runner -m
+
+# Find the latest actions-runner download


Are you sure you want automatic updates here and not a pinned version?

So I thought we had to upgrade to the latest version within 30 days of it being released, but rereading https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners, I think that's only if you've opted out of auto-updates. Either way, it's gonna update itself. So it would be arguably be simpler to use a pinned version and just have it update, but I'm gonna leave it as-is just in case since it sounds like Github isn't interested in maintaining old versions for very long. I added a comment explaining it'll auto-update itself and linking to the policy.

Yeah if you turn off auto-update you have to pick up the new version within 30 days. I turned off auto-update and we manually manage bumping the pinned version. This is more work, but something something supply chain security (plus random breakages that happen at confusing times)

Oh, basically the thing it says in the ratchet README :-)

Hrrrrmph I'm gonna risk it, I think the chances of us not updating within 30 days is high, especially since it looks like they update quite often? https://github.com/actions/runner/releases

Yeah fair. It is a bit annoying to remember to update

GMNGeoffrey · 2022-11-03T00:40:55Z

.github/workflows/self_hosted_runner_utils/setup_runner.sh

+--name $runner_name
+
+# Setup pre-job hook
+cat ~/jax/.github/workflows/self_hosted_runner_utils/runner.env | envsubst >> ~/actions-runner/.env


Oo that's a neat trick. So envsubst allows you to write env variables in your config file and then they get filled in here (like you have $HOME)?

GMNGeoffrey · 2022-11-03T00:42:01Z

.github/workflows/self_hosted_runner_utils/setup_runner.sh

+echo "actions_runner_download: $actions_runner_download"
+
+# Run the rest of the setup as `runner`
+sudo -i -u runner bash -ex <<EOF


So I didn't like heredoc'ing this and created a separate script file. Since you've already cloned the whole repo, you could do that here also

I could! And in fact the logic here is suuuuper subtle because heredoc replaces variables from the containing environment, just as a double-quoted string would. You can use <<'EOF' to get single-quote-like behavior, but I'm actually relying on the double-quote behavior to insert values like $jax_repo_url. IMO this is confusing because it looks like variables should be expanded according to runner's env, not the containing one.

In fact, I realize I had a bug due to this subtle behavior! echo "@reboot $HOME/.../start_github_runner.sh" | crontab - is wrong, it should be echo "@reboot \$HOME/...". Otherwise it expands to /home/skyewm (or whoever's running the script) instead of /home/runner. This worked because I check out the whole repo in order to run this file.

That said, I'm not gonna take your suggestion because I like having everything in one file :P I'll be sure to tell you if/when I regret this decision and split it up so you can tell me "I told you so". I will add a comment and fix the bug though.

GMNGeoffrey · 2022-11-03T00:45:52Z

.github/workflows/self_hosted_runner_utils/start_github_runner.sh

+# More or less copied from
+# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config
+
+~/actions-runner/run.sh > /tmp/actions-runner.`date +"%Y%m%d-%H%M"`.log


stderr also?

Good call, done

GMNGeoffrey · 2022-11-03T00:46:18Z

.github/workflows/self_hosted_runner_utils/setup_runner.sh

+# limitations under the License.
+
+# More or less copied from
+# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config


I think at this point, somewhat less :-) MIght want to tweak the comment

Done ("Heavily influenced by ...")

GMNGeoffrey · 2022-11-03T00:46:47Z

.github/workflows/self_hosted_runner_utils/setup_runner.sh

@@ -0,0 +1,77 @@
+#!/bin/bash


README file for all this might be nice. I guess you've got an internal google doc. Maybe link it from these files?

Done. It mostly just links to the doc for now, but maybe I'll move more of it to be public.

skye · 2022-11-03T19:57:00Z

Thanks @GMNGeoffrey for the thorough review!

This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info. The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing. Testing: - Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180 - Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)

This was meant to be part of jax-ml#13000, oops

skye requested review from hawkinsp and yashk2810 October 27, 2022 00:46

apaszke assigned skye Oct 28, 2022

haydentherapper reviewed Oct 31, 2022

View reviewed changes

yashk2810 approved these changes Oct 31, 2022

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Oct 31, 2022

kokoro-team removed the kokoro:force-run label Oct 31, 2022

skye force-pushed the self_hosted_tpu branch from 1c572d3 to 75f7098 Compare October 31, 2022 23:08

hawkinsp approved these changes Oct 31, 2022

View reviewed changes

google-ml-butler bot added the kokoro:force-run label Oct 31, 2022

kokoro-team removed the kokoro:force-run label Oct 31, 2022

GMNGeoffrey reviewed Nov 1, 2022

View reviewed changes

laurentsimon reviewed Nov 1, 2022

View reviewed changes

hawkinsp approved these changes Nov 1, 2022

View reviewed changes

google-ml-butler bot added the kokoro:force-run label Nov 1, 2022

kokoro-team removed the kokoro:force-run label Nov 1, 2022

skye commented Nov 2, 2022

View reviewed changes

GMNGeoffrey reviewed Nov 3, 2022

View reviewed changes

skye force-pushed the self_hosted_tpu branch from 3e44ff0 to 8c22e34 Compare November 3, 2022 21:16

copybara-service bot merged commit dba9fc0 into jax-ml:main Nov 3, 2022

skye added a commit to skye/jax that referenced this pull request Nov 4, 2022

Add .github/workflows/self_hosted_runner_utils/README.md

cbd429f

This was meant to be part of jax-ml#13000, oops

skye added a commit to skye/jax that referenced this pull request Nov 4, 2022

Add .github/workflows/self_hosted_runner_utils/README.md

52775c4

This was meant to be part of jax-ml#13000, oops

skye mentioned this pull request Nov 4, 2022

Add .github/workflows/self_hosted_runner_utils/README.md #13114

Merged

		@@ -0,0 +1 @@
		ACTIONS_RUNNER_HOOK_JOB_STARTED=$HOME/jax/.github/workflows/self_hosted_runner_utils/validate_job.sh

	# Should also be set to read-only in the project settings.
	# This should also be set to read-only in the project settings, but it's nice to document and enforce the permissions here.

Add Github Actions workflow that runs on a self-hosted TPU VM runner. #13000

Add Github Actions workflow that runs on a self-hosted TPU VM runner. #13000

Uh oh!

Conversation

skye commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skye commented Oct 27, 2022

Uh oh!

skye commented Oct 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurentsimon Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skye commented Nov 1, 2022

Uh oh!

GMNGeoffrey commented Nov 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skye commented Nov 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

skye commented Oct 27, 2022 •

edited

Loading

laurentsimon Nov 1, 2022 •

edited

Loading