Skip to content

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Oct 9, 2025

There were still a few jobs runs were some tests (most recently: test/integration/scheduler_perf/misc) timed out. We could split that up a bit more, but as integration testing with race detection isn't something that needs to complete quickly it's simpler to raise the timeout.

/assign @BenTheElder

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 9, 2025
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 9, 2025
env:
- name: KUBE_TIMEOUT
value: "-timeout=20m"
value: "-timeout=30m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: set the job-level timeout in case this hangs, right now it's the 2h default we have config-wide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not following. Are you suggesting to increase the job timeout?

The job passes in ~70min pretty consistently: https://testgrid.k8s.io/sig-testing-canaries#integration-race-master&graph-metrics=test-duration-minutes

The problem is that scheduler_perf/misc and to a lesser extend scheduler_perf.affinity are close to the 20min per-directory limit, which leads to rare flakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have understood what you meant: I can reduce the job limit to e.g 90min safely based on how long it takes in practice. Then if each individual test times out after 30min, we abort after 90min instead of 120min.

Doesn't look like a significant change, though?

Hmm, how do I actually set the job-level timeout? https://docs.prow.k8s.io/docs/jobs/ doesn't mention it.

I see

    decorate: true
    decoration_config:
      timeout: 5h

but where is that documented?

https://docs.prow.k8s.io/docs/components/pod-utilities/ mentions "decorate: true" and links to https://docs.prow.k8s.io/docs/components/deprecated/plank/

Sorry, I digress. Let's just use copy-and-paste... done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prow could do with revamped docs amongst other things, sig testing are really light on maintainers there at the moment.

But yes, decoration_config timeout is it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the PR okay now? I already reduced the timeout to 90 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest placing it much closer to the intended timeout of the workload, but the PR is fine to merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closer would be something like this:

    decoration_config:
      timeout: 90m
    spec:
      containers:
      - image: us-central1-docker.pkg.dev/k8s-staging-test-infra/images/kubekins-e2e:v20250925-95b5a2c7a5-master
        command:
        - runner.sh
        env:
        - name: KUBE_TIMEOUT
          value: "-timeout=30m"
        - name: KUBE_RACE
          value: "-race"

IMHO that's not "close enough" to make the connection. Some comments would have been better.

It's also an unusual place for decoration_config compared to other jobs. Remember that at some point something besides timeout might need to be configured there.

I prefer to keep it as is and don't want to delay further to add comments.

/hold cancel

There were still a few jobs runs were some tests (most recently:
test/integration/scheduler_perf/misc) timed out. We could split that up a bit
more, but as integration testing with race detection isn't something that needs
to complete quickly it's simpler to raise the timeout.

To prevent accidental long job runs when this individual timeout gets
reached by a higher number of packages, the job timeout gets reduced
from 2h (the default) to 90m.
@pohly pohly force-pushed the integration-race-timeout branch from dbff3b4 to 94dee86 Compare October 10, 2025 06:22
@BenTheElder
Copy link
Member

/lgtm
/approve
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 16, 2025
@k8s-ci-robot k8s-ci-robot merged commit e3ec1c4 into kubernetes:master Oct 17, 2025
6 checks passed
@k8s-ci-robot
Copy link
Contributor

@pohly: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key integration.yaml using file config/jobs/kubernetes/sig-testing/integration.yaml

In response to this:

There were still a few jobs runs were some tests (most recently: test/integration/scheduler_perf/misc) timed out. We could split that up a bit more, but as integration testing with race detection isn't something that needs to complete quickly it's simpler to raise the timeout.

/assign @BenTheElder

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants