[WIP][SPARK-52090] retry fetch on ExecutorDeadException if block is found on another executor #52076

SamWheating · 2025-08-19T21:22:35Z

What changes were proposed in this pull request?

This brings in a fix proposed by @EnricoMi on SPARK-52090, I have made some minor fixes and added test coverage, but overall this is his suggestion.

Opening this PR as I want to get early feedback on this approach - I am new to the internals of Spark so I would really appreciate any input or suggestions.

Why are the changes needed?

Graceful decommissioning on Kubernetes does not work in its current state, as we see FetchFailedExceptions resulting in stage retries almost every time an executor pod is decommissioned.

As a result, it is very hard to run Spark jobs efficiently on an autoscaled cluster, as the evictions performed as a part of cluster autoscaling impede job progress. Marking pods as unsafe-to-evict fixes this, but can prevent efficient cluster scaling and reduce overall cluster resource allocation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I added one unit test in ShuffleBlockFetcherIteratorSuite, also built a patched build and ran some jobs against it.

It seems to fix the main issue, but now we are seeing other issues during decommissioning related to SPARK-38101, which I will look into separately.

Was this patch authored or co-authored using generative AI tooling?

No

EnricoMi · 2025-08-20T07:11:59Z

You could rebase this PR onto my branch. Then the contributions are clear from the git history.

itskals · 2025-08-25T09:10:26Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+            val newBlocksByAddr = blockId match {
+              case ShuffleBlockId(shuffleId, _, reduceId) =>
+                mapOutputTracker.unregisterShuffle(shuffleId)
+                mapOutputTracker.getMapSizesByExecutorId(


Should the update be instantaneously available on the executor, or should the flow give a sleep before reattempting?

github-actions bot added the CORE label Aug 19, 2025

EnricoMi and others added 2 commits August 21, 2025 11:07

Update location of failed blocks and retry if migrated

173dd8c

add unit test, only look for new blocks in the event of executorDead

a75c5c8

SamWheating force-pushed the sw-fix-graceful-decom-fetchfailed branch from 0de85ea to a75c5c8 Compare August 21, 2025 18:12

itskals reviewed Aug 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-52090] retry fetch on ExecutorDeadException if block is found on another executor #52076

[WIP][SPARK-52090] retry fetch on ExecutorDeadException if block is found on another executor #52076

SamWheating commented Aug 19, 2025

Uh oh!

EnricoMi commented Aug 20, 2025

Uh oh!

itskals Aug 25, 2025

Uh oh!

Uh oh!

[WIP][SPARK-52090] retry fetch on ExecutorDeadException if block is found on another executor #52076

Are you sure you want to change the base?

[WIP][SPARK-52090] retry fetch on ExecutorDeadException if block is found on another executor #52076

Conversation

SamWheating commented Aug 19, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

EnricoMi commented Aug 20, 2025

Uh oh!

itskals Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!