Skip to content

Conversation

skye
Copy link
Member

@skye skye commented Oct 27, 2022

This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info.

The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing.

Testing:

@skye skye requested review from hawkinsp and yashk2810 October 27, 2022 00:46
@skye
Copy link
Member Author

skye commented Oct 27, 2022

cc @will-cromar @GMNGeoffrey

@skye
Copy link
Member Author

skye commented Oct 27, 2022

FYI the sample run is failing due to actual test failures. I'd like to check this in even if it's red initially (maybe it's a good thing there are no notifications yet :)), so we can start fixing + testing based on this and see how it goes.

Also cc @yejingxin who's working on getting the tests green.

schedule:
- cron: "0 14 * * *" # daily at 7am PST
workflow_dispatch: # allows triggering the workflow run manually

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to set permissions for the GitHub token?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're already set to read-only via the project settings. Should I also set it here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sufficient, but it's best practice to also include them here, in case someone modifies the project settings later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, done. I put contents: read, that seems to be sufficient (https://github.com/skye/jax/actions/runs/3365397735/jobs/5580821741). It's unclear to me if it even needs that, or if it gets the current checkout "for free".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config-as-code is nice to have during code / audit reviews and incidence response, especially that it's part of the commit history... unlike the settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair

@google-ml-butler google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Oct 31, 2022
schedule:
- cron: "0 14 * * *" # daily at 7am PST
workflow_dispatch: # allows triggering the workflow run manually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard of that best practice before. I guess it makes some sense, but seems a bit verbose. The people changing the project settings should be at least repository admins, and if they change the default settings it's presumably because that's the goal they're trying to achieve. In IREE, I've actually set the default to read-only at the organization level and that appears to make it impossible to set to write at the repo level (which is actually odd; I thought that was only supposed to set the default). I guess this is more explicit in that it documents that this workflow only needs read access to the repository contents and not to releases or whatever. I don't know if I actually have a point...


jobs:
cloud-tpu-test:
runs-on: v4-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also added "tpu" cuz why not.

python-version: ["3.10"] # TODO(jakevdp): update to 3.11 when available.
jaxlib-version: [latest, nightly]
steps:
- uses: actions/checkout@v3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google security policy requires specifying an explicit hash here. See https://opensource.google/documentation/reference/github/services#actions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo didn't know about this, done. (We need to update our other workflows too, but I'll do that as a separate PR)

pip uninstall -y jax jaxlib libtpu-nightly
if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then
pip install .[tpu] \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh unrelated to this PR, but I'd actually be interested in how you decided to set this up. We used to parse the GitHub releases page, but they changed the format, so now we make a hacky not-index and are looking for what the best long-term solution is: iree-org/iree#9532

I see you're using -f and --pre

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use --pre when installing jaxlib nightlies, so pip picks up dev versions (the nightlies are released as jaxlib dev versions). We generate the index file directly from the GCS bucket contents. We decided to do this because we host many large wheels so would quickly exceed pypi's size limits. Happy to answer more questions on this here or offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's take this offline :-)

Comment on lines +56 to +59
- name: Run tests
env:
JAX_PLATFORMS: tpu,cpu
run: python -m pytest --tb=short tests examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The brevity of this job is 👨‍🍳 💋

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I promise it'll get worse with time :)

@@ -0,0 +1 @@
ACTIONS_RUNNER_HOOK_JOB_STARTED=$HOME/jax/.github/workflows/self_hosted_runner_utils/validate_job.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you clone the whole jax repo onto the runner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeop

@@ -0,0 +1,59 @@
name: Cloud TPU nightly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Here for threading]

You will find that getting notifications is way harder than it should be. This is our issue for it: iree-org/iree#9857. Would love to combine efforts on this though. No point reinventing solutions here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gahh! I kinda knew about this but didn't realize how dire the situation was. @yashk2810 will be particularly unhappy. But yes, let's combine efforts, I would really love to figure out a workable solution for notifications.

As a potential workaround for email altogether, @jakevdp wrote this workflow that files an issue on failure: https://github.com/google/jax/blob/main/.github/workflows/upstream-nightly.yaml#L114
It has some fancy logic for reusing an existing open issue, so it doesn't end up spamming the issue tracker on repeated failures.

Alternatively, @alexalemi wrote this workflow for pinging Google Chat on new releases: https://github.com/google/jax/blob/main/.github/workflows/release-notification.yml
Presumably this could be reused to start a chat thread on CI failure. Everyone uses chat instead of email now right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...

return 1
}

if ! is_contained "${GITHUB_EVENT_NAME}" "${ALLOWED_EVENTS[@]}"; then
Copy link

@laurentsimon laurentsimon Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add other sort of validation like branch names? The triggers don't accept arbitrary branches. The workflow_dispatch does but is limited to users with push, so low risk I think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to allow arbitrary branches actually, because it's useful for testing the workflow (since we can't use PRs). We could force devs to setup a fork with their own runner for testing, but given it's low risk to keep as-is, I'd prefer to keep it semi-convenient.

@skye
Copy link
Member Author

skye commented Nov 1, 2022

FYI I just made a fairly large change to setup_runner.sh and the corresponding instructions in go/jax-self-hosted-runners to create a new runner user, instead of whatever user happens to be running the setup commands. @hawkinsp suggested this to make sure the runner user doesn't have sudo permissions. This also seems like a more manageable choice in general. I also made more of the runner setup happen in the setup script now (since it should happen as the new runner user), which makes it more convenient to run but also more complicated to understand.

Thanks all for your comments here and in the doc. I'm gonna take a pass at the PR comments now and will push a new commit.

@GMNGeoffrey
Copy link
Contributor

So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly.

@@ -0,0 +1,61 @@
name: Cloud TPU nightly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GMNGeoffrey stealing your idea of starting a thread.

So I deliberately made the runner user have sudo because that matches what GitHub hosted runners look like. That was also the recommendation from @sethvargo. root access can be quite useful for installing software, looking at low-level system info, etc. I guess in your case that's much less likely to be the case, since these are very specialized runners. Are you planning to do any benchmarking on these? That frequently requires root access to do properly.

Huh. Well I think like the idea of having a separate user anyway (I actually meant to ask you about this with your setup, but you were OOO, and then I forgot :)). We don't need sudo access for this workflow (yet?), and we can always enable that later. But it does add complexity. So I'm leaning towards keeping as-is, but could easily be convinced otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less

@skye
Copy link
Member Author

skye commented Nov 2, 2022

Does anyone have further comments I should address before submitting this change? (I'd like to keep the various discussions going, but would also like to get this change in and iterate.) I'll probably submit in a few hours if I don't hear from anyone, so please ping me if you'd like more time to review.

echo "Event type '${GITHUB_EVENT_NAME}' is not allowed on this runner. Aborting workflow."
# clean up any nefarious stuff we may have fetched in job setup.
cd ~/actions-runner/_work
rm -rfv _actions/ _temp/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not totally sure that this is all the directories that the fetch step could have populated. In our case, we're using ephemeral runners, so that matters much less, but something to investigate. The whole thing where this fetching happens before the pre-job hook is totally weird to me: actions/runner#1951

schedule:
- cron: "0 14 * * *" # daily at 7am PST
workflow_dispatch: # allows triggering the workflow run manually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair

pip uninstall -y jax jaxlib libtpu-nightly
if [ "${{ matrix.jaxlib-version }}" == "latest" ]; then
pip install .[tpu] \
-f https://storage.googleapis.com/jax-releases/libtpu_releases.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's take this offline :-)

@@ -0,0 +1,59 @@
name: Cloud TPU nightly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh filing issues! That's actually a great idea. We should do that. I'm less psyched about chat integration. I think it's bad the amount of stuff that has moved to chat services...

@@ -0,0 +1,61 @@
name: Cloud TPU nightly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I do have a separate user. It just has sudo :-) I think there are arguments either way. I'm not sure how much protection you actually get from someone not having sudo? In our case the runners are ephemeral though, so it matters less

- cron: "0 14 * * *" # daily at 7am PST
workflow_dispatch: # allows triggering the workflow run manually

# Should also be set to read-only in the project settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell here what the "should" means here. I'd expand a bit.

Suggested change
# Should also be set to read-only in the project settings.
# This should also be set to read-only in the project settings, but it's nice to document and enforce the permissions here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

runs-on: [self-hosted, tpu, v4-8]
defaults:
run:
shell: bash -l {0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on why login shells?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I copied it from another workflow :) I tried without specifying shell at all and it seems to work, so I'm just gonna remove this.

steps:
# https://opensource.google/documentation/reference/github/services#actions
# mandates using a specific commit for non-Google actions.
- uses: actions/checkout@93ea575cb5d8a053eaa0ac8fa3b40d7e05a33cc8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our workflows we've added a comment like # v3 to indicate what version this commit points to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. FYI @sethvargo pointed me to https://github.com/sethvargo/ratchet, which I wanna try to update all of our workflows (but I'm saving that for a separate change).

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@13ae5bb136fac2878aff31522b9efb785519f984
with:
python-version: ${{ matrix.python-version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could cache the pip installs here

https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages

Suggested change
python-version: ${{ matrix.python-version }}
python-version: ${{ matrix.python-version }}
cache: 'pip'
cache-dependency-path: build/test-requirements.txt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Like where is it cached? Right now everything is cached since it's not an ephemeral runner, maybe that's bad...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub actions cache. It actually might not work very well because you're not running in Azure right next to their caches

# GCP VM as `runner` using gcloud. Don't do that!
sudo useradd runner -m

# Find the latest actions-runner download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want automatic updates here and not a pinned version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I thought we had to upgrade to the latest version within 30 days of it being released, but rereading https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners, I think that's only if you've opted out of auto-updates. Either way, it's gonna update itself. So it would be arguably be simpler to use a pinned version and just have it update, but I'm gonna leave it as-is just in case since it sounds like Github isn't interested in maintaining old versions for very long. I added a comment explaining it'll auto-update itself and linking to the policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if you turn off auto-update you have to pick up the new version within 30 days. I turned off auto-update and we manually manage bumping the pinned version. This is more work, but something something supply chain security (plus random breakages that happen at confusing times)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, basically the thing it says in the ratchet README :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrrrrmph I'm gonna risk it, I think the chances of us not updating within 30 days is high, especially since it looks like they update quite often? https://github.com/actions/runner/releases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair. It is a bit annoying to remember to update

--name $runner_name

# Setup pre-job hook
cat ~/jax/.github/workflows/self_hosted_runner_utils/runner.env | envsubst >> ~/actions-runner/.env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo that's a neat trick. So envsubst allows you to write env variables in your config file and then they get filled in here (like you have $HOME)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!

echo "actions_runner_download: $actions_runner_download"

# Run the rest of the setup as `runner`
sudo -i -u runner bash -ex <<EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I didn't like heredoc'ing this and created a separate script file. Since you've already cloned the whole repo, you could do that here also

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could! And in fact the logic here is suuuuper subtle because heredoc replaces variables from the containing environment, just as a double-quoted string would. You can use <<'EOF' to get single-quote-like behavior, but I'm actually relying on the double-quote behavior to insert values like $jax_repo_url. IMO this is confusing because it looks like variables should be expanded according to runner's env, not the containing one.

In fact, I realize I had a bug due to this subtle behavior! echo "@reboot $HOME/.../start_github_runner.sh" | crontab - is wrong, it should be echo "@reboot \$HOME/...". Otherwise it expands to /home/skyewm (or whoever's running the script) instead of /home/runner. This worked because I check out the whole repo in order to run this file.

That said, I'm not gonna take your suggestion because I like having everything in one file :P I'll be sure to tell you if/when I regret this decision and split it up so you can tell me "I told you so". I will add a comment and fix the bug though.

# More or less copied from
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config

~/actions-runner/run.sh > /tmp/actions-runner.`date +"%Y%m%d-%H%M"`.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stderr also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done

# limitations under the License.

# More or less copied from
# https://github.com/iree-org/iree/tree/main/build_tools/github_actions/runner/config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point, somewhat less :-) MIght want to tweak the comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ("Heavily influenced by ...")

@@ -0,0 +1,77 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README file for all this might be nice. I guess you've got an internal google doc. Maybe link it from these files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It mostly just links to the doc for now, but maybe I'll move more of it to be public.

@skye
Copy link
Member Author

skye commented Nov 3, 2022

Thanks @GMNGeoffrey for the thorough review!

This also includes some utilites for setting up the self-hosted
runner. Googlers, see go/jax-self-hosted-runners for more setup info.

The workflow is pretty basic currently. We can and should add more
functionality later, such as email notifications. I kept it simple
here for easier reviewing.

Testing:
- Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180
- Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)
@copybara-service copybara-service bot merged commit dba9fc0 into jax-ml:main Nov 3, 2022
skye added a commit to skye/jax that referenced this pull request Nov 4, 2022
skye added a commit to skye/jax that referenced this pull request Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pull ready Ready for copybara import and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants