Skip to content

Conversation

@wwsean08
Copy link

This change implements #438 by creating a new package for interfacing with AWS's KMS service by changing how the KMS service is setup and passed around. For legacy purposes I added a new environment variable to set the KMS type but have defaulted it to GCP as to not break existing setups that pull in this change.

Testing done:
I've updated and created some unit tests, I would have created more for the AWS KMS service but it doesn't have an interface in the AWS sdk preventing me from mocking it, if you would like me to create tests by messing with the http client can do that as well.

In addition to that I have set this up and run it against AWS KMS, and have verified it works thru manual testing in a personal test organization using my own octosts app.

Copy link
Collaborator

@cpanato cpanato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice, I just would like to organize the code better then as you are creating a new kms provider

I was thinking in

- pkg
   |
   - kms
     | 
     - aws
     - gcp
     kms.go
     kms_test.go

something similar we do in https://github.com/sigstore/sigstore/tree/main/pkg/signature/kms

thanks for working on this

@wwsean08
Copy link
Author

looks nice, I just would like to organize the code better then as you are creating a new kms provider

I was thinking in

- pkg
   |
   - kms
     | 
     - aws
     - gcp
     kms.go
     kms_test.go

something similar we do in https://github.com/sigstore/sigstore/tree/main/pkg/signature/kms

thanks for working on this

That makes total sense, I'll run with it :)

@ChrisJBurns
Copy link

@wwsean08 Would love to hear more about how you've got this to work in AWS. What kind of infrastructure did you need? I'm having trouble at the moment trying to map the cloudrun apps for the webhooks and prober and what they would need to look like in AWS.

@wwsean08
Copy link
Author

@ChrisJBurns so I have a minimal setup for testing purposes running in a DigitalOcean VM authenticating with AWS because $ and not wanting to set off billing alarms in my personal AWS account (I did the testing on my own account since I mostly developed this on my own time)

I didn't actually setup the webhook and prober, just the main app, I imported the key into KMS following the instructions provided by AWS, then setup an IAM policy to allow access to the key and setup an IAM user which I then setup the credentials via environment variables which would get picked up by the AWS library. In theory if this is running on an EC2 instance then you could used a role binding on the instance which would work, and I will test this weekend just to verify, similarly if you were running in kubernetes I expect an IRSA role would work based on my experience in the past.

@wwsean08
Copy link
Author

wwsean08 commented Jun 3, 2025

Hey @ChrisJBurns , just letting you know, I haven't forgotten, I realized when trying to set it up in AWS that since I never bothered to setup the webhook I need to make additional changes to make that possible, which I will be working on this week.

@ChrisJBurns
Copy link

@wwsean08 Cheers bud! I'm getting more and more time to look at this on our-side too and what it would look like. Whether its a Kubernetes variant (most likely for now) or possible an ECS variant. Happy to hear that there are others also looking at the AWS side of things! 🚀

@wwsean08
Copy link
Author

wwsean08 commented Jun 4, 2025

Ok I finished writing up some quick refactoring to handle both AWS and GCP, I need to get them reviewed internally and test them before publishing the changes, but expect them either later this week or early next week, I'm currently on call and handling the support queue, so it's a busy week :)

@ChrisJBurns
Copy link

ChrisJBurns commented Jun 7, 2025

@wwsean08 I've been making steady progress on the AWS ECS setup. I've successfully deployed the main app container, using placeholder environment variables for the time being - still working through it step by step.

One challenge I’ve encountered is that the images hosted on cgr.dev aren't free to use, which complicates things a bit. For your changes, I ended up creating a standard Dockerfile and performed a manual go build. Going forward, it looks like we’ll need to maintain custom image builds, as the latest versions on cgr.dev aren't accessible.

Edit:
I opted for ECS over Kubernetes primarily because ECS within its own VPC allows me to provision a dedicated Application Load Balancer. This avoids having the Octo STS endpoint share an existing load balancer and URL that the EKS cluster has. It also allows us to somewhat isolate the blast radius of attack/outage. While I may eventually create a Helm chart for Octo-STS, my current goal is to have the ECS implementation mirror the GCP CloudRun deployment structure defined in the iac folder.

@wwsean08
Copy link
Author

wwsean08 commented Jun 8, 2025

So I've just pushed up a change that has a working* webhook, which you can see here.

Screenshot 2025-06-07 at 10 27 07 PM

I have that asterisks because I somehow with two changes hit API rate limits due to it trying to refresh the installation id token, and I'm unsure of why it's spamming that endpoint, but it doesn't seem like something that I introduced, but will keep digging in. I also looked over the probers and am fairly confidence the way they are coded won't work for anyone outside of chainguard to host the images, and if you wanted them you'd have to code up your own to be slightly different, pointing at your own repo/policies since the repo that it probes against is internal/private so we can't really see the contents of those policies at a minimum, and your github app likely won't be installed into chainguard's org thus not actually making it a valid test, though I'll defer to the maintainers on my analysis of that.

Chainguard does have some free images, I used them to build my test setup in EKS, here's an example of my dockerfile for building the app

FROM cgr.dev/chainguard/go AS builder
COPY . /app
ENV CGO_ENABLED=0
RUN cd /app/cmd/app && go build -ldflags '-s -w' -o app .

FROM cgr.dev/chainguard/static
COPY --from=builder /app/cmd/app/app /usr/bin/
ENTRYPOINT ["/usr/bin/app"]

@wwsean08 wwsean08 requested a review from cpanato June 8, 2025 17:19
@ChrisJBurns
Copy link

ChrisJBurns commented Jun 8, 2025

@wwsean08 The rate limiting stuff is interesting!! I'm also wondering if the eventing functionality is needed for AWS too? As I think that's only relevant for GCP but it still forces the user to provide a variable to it.

On the image building point, I believe that's what most will have to do unless they have a Chainguard Images membership. As the app and the webhook are only accesible to paying customers. However, I think it is possible that we can build a Dockerfile with wolfi as a base and just install the app or webhook apk's as the version history seems to be maintained!

@mattmoor
Copy link
Member

mattmoor commented Jun 8, 2025

You don't need a Dockerfile for this, see here:

app/iac/main.tf

Lines 21 to 24 in e463b50

resource "ko_build" "this" {
working_dir = "${path.module}/.."
importpath = "./cmd/app"
}

If you are using K8s manifests than I'd just use ko directly vs. via TF.

@ChrisJBurns
Copy link

Ahh even better, cheers @mattmoor

@wwsean08
Copy link
Author

After spending more time debugging it seems like it's related to the the event processing getting into an infinite loop somehow where it keeps creating new checks for the same change until it sorta crashes out due to github's api stopping it from sending more checks via rate limiting, so I think it's likely a configuration issue on my end that I've done that's causing what I'm seeing.

@ChrisJBurns
Copy link

ChrisJBurns commented Jun 19, 2025

@wwsean08 That's interesting because as far as I can see, you've not modified any event code. How did you configure your Github App? Is there anything in there that may cause a recursive event loop?

@ChrisJBurns
Copy link

ChrisJBurns commented Jun 19, 2025

I've given this a go on ECS and I'm getting the following in the main app container:

INFO exchange request: &v1.ExchangeRequest{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(*****)}, sizeCache:0, unknownFields:[]uint8(nil), Aud:[]string(nil), Scope:"$ORG/$REPO", Identity:"foo", Cap:[]string(nil), IdentityProvider:""}

But nothing tends to happen after that. I get Attempt 1 failed. Error: HTTP error! status: 500 logs in the Github Action too even though I get a 200 when I go to both the domain and webhook endpoint. The webhook also has tonnes of 2025/06/19 22:06:29 ERROR error validating payload: mime: no media type X-GitHub-Delivery="" X-GitHub-Event="" errors but I think that's just the health check - although at this point I'm not too sure.

---- EDIT
Ignore me, I forgot to install the Github App into the repository and only had it created under the org. I get a new error now, but i shall progress more. 🙃

@ChrisJBurns
Copy link

ChrisJBurns commented Jun 19, 2025

@wwsean08 I can confirm that I think I'm getting the same issue as you. My "Trusty Policy" passes but my pipeline action fails due to a 500 - still not sure why I'm getting 500 but I'm more interested in the looping over the checkruns that the webhook is doing.

I'm seeing a bunch of INFO created CheckRun X-GitHub-Delivery=........ in the webhook logs.

Here is my actions associated with a PR
image

I should add, I've pulled this PR branch and built an image and am deploying it to AWS.

@ChrisJBurns
Copy link

ChrisJBurns commented Jun 19, 2025

I've managed to somewhat stopped the looping of events by unchecking the check_run event in the Github App Permissions and Events page. However, I'm not sure if this is needed to run because I still get failures in the Github Action relating to:

Attempt 1 failed. Error: HTTP error! status: 500
Attempt 2 failed. Error: HTTP error! status: 500
Attempt 3 failed. Error: HTTP error! status: 500

I see a bunch of logs in the webhook and there don't seem to be any errors.

@mattmoor @cpanato out of curiosity, would you be able to provide a list of the events that the Octo STS App is subscribed too? I've noticed in the README.md you're explicit on the permissions it needs, would it be possible to document the events too? This allows me to know if the check_run needs to be unticked (and many more). In addition, is there any other way of documenting how the Github App is configured just so those who are self-hosting can be certain they aren't configuring it wrong?

@mattmoor
Copy link
Member

I think this is a @wlynch question

@ChrisJBurns
Copy link

Thanks @mattmoor .

@wlynch If you could give some direction on this that would be awesome! Would love to know what we're missing

@ChrisJBurns
Copy link

@wlynch just chasing the above incase it got lost in the notification craziness

Updates based on PR feedback

sort alphabetically

refactor to allow using either AWS secrets manager or GCP secrets manager in a repeatable pattern

implement tests for gcp and update aws dependencies

fix sort order

chore: update dependencies
@wwsean08
Copy link
Author

Just popping in to say I've updated the deps and dealt with the merge conflicts (also squashed it down to 1 commit to make future rebases easier)

@wlynch
Copy link
Collaborator

wlynch commented Jul 30, 2025

Yeah, currently the octo-sts app is configured to only listen to PR and Push events, so this is indeed a bug (but one that is easily fixable) - I'll open a PR for this separately!

pkg/kms/kms.go Outdated
Comment on lines 24 to 30
type kmsProvider struct {
ctx context.Context
provider string
kmsKey string
gcpClient *kmsGCP.KeyManagementClient
awsClient *kmsAWS.Client
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing all of the clients here, have each impl have a function that returns a struct that implements the KMS interface. Then we can do something like:

func NewKMS(ctx context.Context, provider, kmsKey string) (KMS, error) {
	switch strings.ToLower(provider) {
	case GCP:
		return gcp.NewProvider(ctx, kmsKey)
	case AWS:
		return aws.NewProvider(ctx, kmsKey)
	default:
		return nil, errors.New("unsupported kms provider")
	}
}

Alternatively, skip the KMS interface altogether and create funcs along the lines of gcp.NewDefault(ctx, key) that do the same thing here with default clients initialized.

@wlynch
Copy link
Collaborator

wlynch commented Jul 30, 2025

#988 for filtering out what I believe are the problematic webhook events

@ChrisJBurns
Copy link

@wwsean08 @wlynch Amazing thanks folks! I've been away this week but I plan on giving this another try on the weekend with only push and PR events!

@ChrisJBurns
Copy link

ChrisJBurns commented Aug 2, 2025

@wwsean08 I've got some changes made to implement the above feedback by @wlynch as well as the code style check fixes, not sure the best way of submitting them. I could either raise a PR against this repo (would mean a duplicate PR though), or raise it against your fork? To do the latter, I'd need to have permissions to push to a branch 😄


EDIT: Ignore me, I've seemed to dust of some git upstream commands from the ether of my memory and managed to get a PR up in your repo from my octo-sts/app fork 😄 without the need for any special perms. wbd-open-source#1

ChrisJBurns and others added 2 commits August 2, 2025 18:16
@ChrisJBurns
Copy link

@wlynch I believe @wwsean08 has merged my PR into his branch now, if you could give it another pass that would be amazing!

@ChrisJBurns
Copy link

Bumping this incase its got buried in notifications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants