-
Notifications
You must be signed in to change notification settings - Fork 2.8k
integration-race: bump timeout from 20 to 30 minutes #35669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integration-race: bump timeout from 20 to 30 minutes #35669
Conversation
env: | ||
- name: KUBE_TIMEOUT | ||
value: "-timeout=20m" | ||
value: "-timeout=30m" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: set the job-level timeout in case this hangs, right now it's the 2h default we have config-wide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not following. Are you suggesting to increase the job timeout?
The job passes in ~70min pretty consistently: https://testgrid.k8s.io/sig-testing-canaries#integration-race-master&graph-metrics=test-duration-minutes
The problem is that scheduler_perf/misc and to a lesser extend scheduler_perf.affinity are close to the 20min per-directory limit, which leads to rare flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have understood what you meant: I can reduce the job limit to e.g 90min safely based on how long it takes in practice. Then if each individual test times out after 30min, we abort after 90min instead of 120min.
Doesn't look like a significant change, though?
Hmm, how do I actually set the job-level timeout? https://docs.prow.k8s.io/docs/jobs/ doesn't mention it.
I see
decorate: true
decoration_config:
timeout: 5h
but where is that documented?
https://docs.prow.k8s.io/docs/components/pod-utilities/ mentions "decorate: true" and links to https://docs.prow.k8s.io/docs/components/deprecated/plank/
Sorry, I digress. Let's just use copy-and-paste... done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prow could do with revamped docs amongst other things, sig testing are really light on maintainers there at the moment.
But yes, decoration_config timeout is it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the PR okay now? I already reduced the timeout to 90 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest placing it much closer to the intended timeout of the workload, but the PR is fine to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closer would be something like this:
decoration_config:
timeout: 90m
spec:
containers:
- image: us-central1-docker.pkg.dev/k8s-staging-test-infra/images/kubekins-e2e:v20250925-95b5a2c7a5-master
command:
- runner.sh
env:
- name: KUBE_TIMEOUT
value: "-timeout=30m"
- name: KUBE_RACE
value: "-race"
IMHO that's not "close enough" to make the connection. Some comments would have been better.
It's also an unusual place for decoration_config
compared to other jobs. Remember that at some point something besides timeout
might need to be configured there.
I prefer to keep it as is and don't want to delay further to add comments.
/hold cancel
There were still a few jobs runs were some tests (most recently: test/integration/scheduler_perf/misc) timed out. We could split that up a bit more, but as integration testing with race detection isn't something that needs to complete quickly it's simpler to raise the timeout. To prevent accidental long job runs when this individual timeout gets reached by a higher number of packages, the job timeout gets reduced from 2h (the default) to 90m.
dbff3b4
to
94dee86
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pohly: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There were still a few jobs runs were some tests (most recently: test/integration/scheduler_perf/misc) timed out. We could split that up a bit more, but as integration testing with race detection isn't something that needs to complete quickly it's simpler to raise the timeout.
/assign @BenTheElder