Skip to content

Conversation

Neelabh94
Copy link
Contributor

@Neelabh94 Neelabh94 commented Sep 22, 2025

The gke-hyperdisk-test has been failing continuously for the last ~2 months. There were no direct changes pertaining to the test. RCA revealed changes in the version of dependency modules causing the failure.

This fix pins the transformer and datasets module versions in hyperdisk example.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@Neelabh94 Neelabh94 added the test-enhancement Tests enhancement or coverage improvement label Sep 22, 2025
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from dd19029 to 11b9cc6 Compare September 24, 2025 05:29
@Neelabh94 Neelabh94 self-assigned this Sep 24, 2025
@Neelabh94 Neelabh94 changed the title fix(gke): add compute.admin role to GKE node pool service account fix(gke-hyperdisk-test): Pin the dependency versions Sep 24, 2025
@Neelabh94 Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from 5eca840 to b6a98ca Compare September 24, 2025 06:23
@Neelabh94 Neelabh94 marked this pull request as ready for review September 24, 2025 09:47
@Neelabh94 Neelabh94 requested review from samskillman and a team as code owners September 24, 2025 09:47
@Neelabh94 Neelabh94 added the release-chore To not include into release notes label Sep 24, 2025
Copy link
Collaborator

@bytetwin bytetwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to the RCA on how pinning the dependency version succeeds the test. The test should not be dependent on a specific version of transformer as its not the core part of the blueprint to test.
Moreover cloud docs for running tensorflow examples are not pinning the dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-chore To not include into release notes test-enhancement Tests enhancement or coverage improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants