-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53339][CONNECT] Fix an issue which occurs when an operation in pending state is interrupted #52083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cc: @peter-toth |
cc: @dongjoon-hyun too. |
Thank you for pinging me, @sarutak . I just came back from my vacation today. |
@@ -243,6 +243,9 @@ private[connect] class ExecuteHolder( | |||
* true if it was not interrupted before, false if it was already interrupted. | |||
*/ | |||
def interrupt(): Boolean = { | |||
if (eventsManager.status == ExecuteStatus.Pending) { | |||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
According to the function description,
false
is already occupied for the status where it was already interrupted. -
For the code change, according to the state transition, can we change the status to
ExecuteStatus.Canceled
status directly from the currentExecuteStatus.Pending
because it's not started yet? In this case, we can returntrue
.
Lines 38 to 47 in 79a0ca7
object ExecuteStatus { | |
case object Pending extends ExecuteStatus(0) | |
case object Started extends ExecuteStatus(1) | |
case object Analyzed extends ExecuteStatus(2) | |
case object ReadyForExecution extends ExecuteStatus(3) | |
case object Finished extends ExecuteStatus(4) | |
case object Failed extends ExecuteStatus(5) | |
case object Canceled extends ExecuteStatus(6) | |
case object Closed extends ExecuteStatus(7) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the code change, according to the state transition, can we change the status to ExecuteStatus.Canceled status directly from the current ExecuteStatus.Pending because it's not started yet? In this case, we can return true.
Actually, that was my first idea to solve this issue. But as I mentioned in the description, I found that didn't work because transitioning from Pending
to Canceled
causes another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the function description, false is already occupied for the status where it was already interrupted.
Hmm, if it's OK to ignore interruption to a pending state operation and we need exactly tell already interrupted
from interruption failed
, how about returning the exact interruption result rather than boolean
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you. As long as the code and description are consistent, I'm okay for both. (1) Updating the description by changing the meaning of false
and (2) changing the return types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. I'll simply update the description.
BTW, thank you for the investigation to identify the root cause. cc @grundprinzip , @hvanhovell , @zhengruifeng to ask if this was the intentional design of state transition or not. |
There are two things at play here: the internal state of the operation itself and the notification on the listener bus. If this patch simply ignores the interrupt on an operation in a pending state, there is a new edge case where we can never cancel this operation if it's stuck in a pending state for whatever reason. Previously, it seems that while the physical query was cancelled, only the observable operation state on the listener bus was not properly handled. I understand that there is another race condition when the interrupt happens right between the incoming request and the different posting states. I think the better solution is not to ignore the interruption, but we need to figure out how to avoid double-posting of events. |
Given that this is one of the long standing At the same time, we can discuss more in order to figure out the correct steps in Apache Spark 4.1.0 timeframe as @grundprinzip suggested in the above. |
@dongjoon-hyun |
What changes were proposed in this pull request?
This PR fixes an issue which occurs when an operation in pending state is interrupted.
Once an operation in pending state is interrupted, the interruption and following all interruption for the operation never work correctly.
You can easily reproduce this issue by modifying
SparkConnectExecutionManager#createExecuteHolderAndAttach
like as follows.And then run a test
interrupt all - background queries, foreground interrupt
inSparkSessionE2ESuite
.You will see the following error.
If an operation in pending state is interrupted, the interruption is handled in
ExecuteHolder#interrupt
and ErrorUtils.handleError is called inErrorUtils#handleError
, the operation status transitions toCanceled
by calling executeEventsManager.postCanceled.But
postCanceled
does not expect transition from pending state so an exception is thrown and propagated to the caller ofExecuteThreadRunner#interrupt
.The reason following all interruptions for the same operation never works correctly is that
ExecuteThreadRunner#state
has already been changed tointerrupted
here at the first call ofExecuteThreadRunner#interrupt
and following interruptions don't enter this loop and this method always returnsfalse
, causing the result of interruption is not correctly recognized.Another solution would be changing
postCanceled
to transition from pending state but it causes another issue.If it's changed as such,
postStarted
here could be called for an operation in calceled state.So, this PR just ignores interruption when the target operation is pending state.
Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add a test which confirms that interruption is ignored when the target operation is pending state.
I also confirmed that
SparkSessionE2ESuite
mentioned above succeeded.Was this patch authored or co-authored using generative AI tooling?
No.