-
Notifications
You must be signed in to change notification settings - Fork 130
Fix signalling around unblock #6876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@daviddavis would you mind taking a look at this? |
I'm not sure about the code as I haven't worked with Pulp's tasking system in a while, but I upgraded pulpcore to 3.84 in our pulp image and applied this change as a patch and we're still getting stuck tasks:
|
So there must be a way for a new task to sneak in while the (single available) worker is unblocking and so the signal gets lost. I still cannot really see it though. |
Is canceling a task part of the story when this happens? (I'm wondering if canceling a task properly retriggers unblocking subsequent tasks now.) |
No, task cancellation isn't involved. Interesting find though from testing things out: creating a new task causes the task stuck at "waiting" to be processed. By the way if you wanted to experiment yourself, I think it should be really easy to reproduce: just run a single worker and start two tasks in parallel (or quick succession). Happens like 80-90% of the time for me. |
That is absolutely expected. And one of the reasons, why I'm not too concerned with this issue, at least for rather busy installations.
Interesting. I didn't get it to happen even once... |
6d702ce
to
d992236
Compare
d992236
to
1077a36
Compare
Ok, I get what is happening. In the sleep state (also in the supervise code, fwiw), the worker will only wake up to unblock if there was a message in the pg connection in the first place. pulpcore/pulpcore/tasking/worker.py Lines 370 to 373 in 14322ad
But it can happen (safe to say that it IS happening) that something flushes the connection before worker hits the So even though So we should have something like that in the places where this happens. if connection.connection in r:
connection.connection.execute("SELECT 1")
- if self.wakeup_unblock:
- self.unblock_tasks()
+if self.wakeup_unblock:
+ self.unblock_tasks() Here are some actual (commented) logs from my experiments: --------------------------------------------------------------------------------
App.publish(wakeup-unblock)
Worker.received(pulp_worker_wakeup:unblock)
Worker.wakeup(unblock)
Worker.publish(handle)
Worker.received(pulp_worker_wakeup:handle)
Worker.wakeup(handle) self.wakeup_unblock=False self.wakeup_handle=True
Worker.started(pulpcore.app.tasks.test.sleep)
Worker.connection(has-msg=False)
--------------------------------------------------------------------------------
# At this point, the worker is on supervise and handles the dispatch wakeup notification
App.publish(wakeup-unblock)
Worker.received(pulp_worker_wakeup:unblock)
# here the worker is finishing the task, triggered a new unblock notify and received the notification itself
# when the subscriber is the same as the publisher, the connection buffer is not used...
# We already have self.wakeup_unblock=True, anyway
Worker.connection(has-msg=False) # spying to see if there is something in the conn buffer. Never more
Worker.publish(unblock)
Worker.received(pulp_worker_wakeup:unblock)
Worker.finished(pulpcore.app.tasks.test.sleep)
Worker.connection(has-msg=False)
# We go to sleep state with self.wakeup_unblock=True but nothing in the connection buffer
Worker.sleep() self.wakeup_unblock=True self.wakeup_handle=False
Worker.connection(has-msg=False)
# and it ends here... |
ps: this is basically what I wanted to make more predictable with the pubsub interface specification we've discussed a while back, more specifically with this not so elegant but apparently functional implementation of fileno(). |
Great analysis! So what we could do in the current implementation is (in addition to what you suggested already) reduce the timeout of select to 0 whenever some message is pending. Or maybe we want to introduce a bit of laziness there deliberately. |
1077a36
to
8bb933e
Compare
8bb933e
to
a6bac24
Compare
These situations make me wonder (accepting that any sort of non-serial execution is complex) whether the whole concept could be expressed more intuitive in async python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, it makes sense to shorten the timeout. I'm good with 0 for now.
A little on the fence with the pacemaker, but it's fine. We'll assure tasks get unblocked if anything goes wrong.
Backport to 3.90: 💚 backport PR created✅ Backport PR branch: Backported as #6979 🤖 @patchback |
TODO Clean up changelog before approving!!!
fixes #6873