Skip to content

Conversation

SamWheating
Copy link

What changes were proposed in this pull request?

This brings in a fix proposed by @EnricoMi on SPARK-52090, I have made some minor fixes and added test coverage, but overall this is his suggestion.

Opening this PR as I want to get early feedback on this approach - I am new to the internals of Spark so I would really appreciate any input or suggestions.

Why are the changes needed?

Graceful decommissioning on Kubernetes does not work in its current state, as we see FetchFailedExceptions resulting in stage retries almost every time an executor pod is decommissioned.

As a result, it is very hard to run Spark jobs efficiently on an autoscaled cluster, as the evictions performed as a part of cluster autoscaling impede job progress. Marking pods as unsafe-to-evict fixes this, but can prevent efficient cluster scaling and reduce overall cluster resource allocation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I added one unit test in ShuffleBlockFetcherIteratorSuite, also built a patched build and ran some jobs against it.

It seems to fix the main issue, but now we are seeing other issues during decommissioning related to SPARK-38101, which I will look into separately.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Aug 19, 2025
@EnricoMi
Copy link
Contributor

You could rebase this PR onto my branch. Then the contributions are clear from the git history.

@SamWheating SamWheating force-pushed the sw-fix-graceful-decom-fetchfailed branch from 0de85ea to a75c5c8 Compare August 21, 2025 18:12
val newBlocksByAddr = blockId match {
case ShuffleBlockId(shuffleId, _, reduceId) =>
mapOutputTracker.unregisterShuffle(shuffleId)
mapOutputTracker.getMapSizesByExecutorId(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the update be instantaneously available on the executor, or should the flow give a sleep before reattempting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants