Skip to content

Conversation

mdellweg
Copy link
Member

TODO Clean up changelog before approving!!!

fixes #6873

@github-actions github-actions bot added the wip label Aug 21, 2025
@mdellweg
Copy link
Member Author

@daviddavis would you mind taking a look at this?

@daviddavis
Copy link
Contributor

I'm not sure about the code as I haven't worked with Pulp's tasking system in a while, but I upgraded pulpcore to 3.84 in our pulp image and applied this change as a patch and we're still getting stuck tasks:

-[ RECORD 1 ]-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pulp_id                   | 0198ce0f-dd83-7f52-9c34-56670bb89ef3
pulp_created              | 2025-08-21 19:16:43.061592+00
pulp_last_updated         | 2025-08-21 19:16:43.013158+00
state                     | waiting
name                      | pulpcore.app.tasks.base.general_create
started_at                |
finished_at               |
error                     |
worker_id                 |
parent_task_id            |
logging_cid               | ab1a0818d3db40ff8ccbec75fdd22d29
reserved_resources_record | {shared:prn:core.domain:431fb816-c18e-4b8a-83c8-8b5b0013f67c}
task_group_id             |
pulp_domain_id            | 431fb816-c18e-4b8a-83c8-8b5b0013f67c
versions                  | "core"=>"3.84.0"
enc_args                  | ["gAAAAABop3Cbk_oPaXSI_oQi8ZxHNMGwJhRKGQJYj9jAgnDFl4ibh24pywSeO979poZDj4R2deeAJySKjVTe2VcTsVZxnjaRWA==", "gAAAAABop3Cbt3QHMmKIc5Y9lqfcOPrsMZooDkjKx2c3ZJtEBbTUh-RKE7yIvwe9xILMZlKpKvBd32k6smMVgTXDqQbe4GRR2MZKUCB9m3HXfev7PfBN5j4="]
enc_kwargs                | {"data": {"artifact": "gAAAAABop3CbpAdLGGx5lZGK2V8usCkn9KkGKYW__T19T439aCOxmRgF3sIDGvHXgLYPLSARCBxDknACHY0fqbzRM2gwNtzvmFJEo4K47nny9vZD8N4MWQGP59hdbQjL06eP4d2erpRb94hHVtmpBDODZzDbYQtXZA==", "relative_path": "gAAAAABop3CbEM7EUBe9Kr4D_ePc5t0p16Zb5Up45FcFew5_XgV04xhnpM0vynK00ZqZBx-o7UsfMaGrvob6yQZ3hm3DXRKBXQ=="}, "context": {}}
unblocked_at              |
deferred                  | t
immediate                 | f
profile_options           |

@mdellweg
Copy link
Member Author

So there must be a way for a new task to sneak in while the (single available) worker is unblocking and so the signal gets lost. I still cannot really see it though.
Even weirder that the signal condition to unblock tasks was never reset due to a bad typo.

@mdellweg
Copy link
Member Author

mdellweg commented Sep 8, 2025

Is canceling a task part of the story when this happens? (I'm wondering if canceling a task properly retriggers unblocking subsequent tasks now.)

@daviddavis
Copy link
Contributor

daviddavis commented Sep 8, 2025

No, task cancellation isn't involved. Interesting find though from testing things out: creating a new task causes the task stuck at "waiting" to be processed.

By the way if you wanted to experiment yourself, I think it should be really easy to reproduce: just run a single worker and start two tasks in parallel (or quick succession). Happens like 80-90% of the time for me.

@mdellweg
Copy link
Member Author

[...] Interesting find though from testing things out: creating a new task causes the task stuck at "waiting" to be processed.

That is absolutely expected. And one of the reasons, why I'm not too concerned with this issue, at least for rather busy installations.

By the way if you wanted to experiment yourself, I think it should be really easy to reproduce: just run a single worker and start two tasks in parallel (or quick succession). Happens like 80-90% of the time for me.

Interesting. I didn't get it to happen even once...

@mdellweg mdellweg force-pushed the unblock_signalling branch 3 times, most recently from 6d702ce to d992236 Compare September 24, 2025 10:32
@mdellweg mdellweg marked this pull request as ready for review September 24, 2025 10:32
@github-actions github-actions bot removed the wip label Sep 24, 2025
@mdellweg mdellweg changed the title WIP: Fix signalling around unblock Fix signalling around unblock Sep 24, 2025
@pedro-psb
Copy link
Member

Ok, I get what is happening.

In the sleep state (also in the supervise code, fwiw), the worker will only wake up to unblock if there was a message in the pg connection in the first place.

if connection.connection in r:
connection.connection.execute("SELECT 1")
if self.wakeup_unblock:
self.unblock_tasks()

But it can happen (safe to say that it IS happening) that something flushes the connection before worker hits the select.select. Not sure exactly what, but maybe a heartbeat, some janitorial work or whatever. It's not a big surprise.

So even though worker.wakeup_unblock=True at this point (flushing the pg connection means pg_handler was called and updated the worker state), now the connection is empty and when it hits select.select it won't try to unblock, because the unblock depends on the connection having something.

So we should have something like that in the places where this happens.

if connection.connection in r:
    connection.connection.execute("SELECT 1")
-    if self.wakeup_unblock:
-        self.unblock_tasks()
+if self.wakeup_unblock:
+    self.unblock_tasks()

Here are some actual (commented) logs from my experiments:

--------------------------------------------------------------------------------
App.publish(wakeup-unblock)
Worker.received(pulp_worker_wakeup:unblock)
Worker.wakeup(unblock)
Worker.publish(handle)
Worker.received(pulp_worker_wakeup:handle)
Worker.wakeup(handle) self.wakeup_unblock=False self.wakeup_handle=True
Worker.started(pulpcore.app.tasks.test.sleep)
Worker.connection(has-msg=False)
--------------------------------------------------------------------------------
# At this point, the worker is on supervise and handles the dispatch wakeup notification
App.publish(wakeup-unblock)
Worker.received(pulp_worker_wakeup:unblock)

# here the worker is finishing the task, triggered a new unblock notify and received the notification itself
# when the subscriber is the same as the publisher, the connection buffer is not used...
# We already have self.wakeup_unblock=True, anyway
Worker.connection(has-msg=False)  # spying to see if there is something in the conn buffer. Never more
Worker.publish(unblock)
Worker.received(pulp_worker_wakeup:unblock)
Worker.finished(pulpcore.app.tasks.test.sleep)
Worker.connection(has-msg=False)

# We go to sleep state with self.wakeup_unblock=True but nothing in the connection buffer
Worker.sleep() self.wakeup_unblock=True self.wakeup_handle=False
Worker.connection(has-msg=False)

# and it ends here...

@pedro-psb
Copy link
Member

ps: this is basically what I wanted to make more predictable with the pubsub interface specification we've discussed a while back, more specifically with this not so elegant but apparently functional implementation of fileno().
In the context of the pubsub interface and from an user perspective (e.g, the worker), the fileno() doesn't refer specifically to pg connection, which is an implementation detail, but to an abstract internal buffer that either has or hasn't message in it.

@mdellweg
Copy link
Member Author

Great analysis!

So what we could do in the current implementation is (in addition to what you suggested already) reduce the timeout of select to 0 whenever some message is pending. Or maybe we want to introduce a bit of laziness there deliberately.

@mdellweg
Copy link
Member Author

ps: this is basically what I wanted to make more predictable with the pubsub interface specification we've discussed a while back, more specifically with this not so elegant but apparently functional implementation of fileno(). In the context of the pubsub interface and from an user perspective (e.g, the worker), the fileno() doesn't refer specifically to pg connection, which is an implementation detail, but to an abstract internal buffer that either has or hasn't message in it.

These situations make me wonder (accepting that any sort of non-serial execution is complex) whether the whole concept could be expressed more intuitive in async python.

Copy link
Member

@pedro-psb pedro-psb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it makes sense to shorten the timeout. I'm good with 0 for now.

A little on the fence with the pacemaker, but it's fine. We'll assure tasks get unblocked if anything goes wrong.

@mdellweg mdellweg merged commit e5a125b into pulp:main Sep 25, 2025
25 of 27 checks passed
@mdellweg mdellweg deleted the unblock_signalling branch September 25, 2025 12:19
Copy link

patchback bot commented Sep 25, 2025

Backport to 3.90: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.90/e5a125b1946f7394407ae5ca186d17bdb50286ee/pr-6876

Backported as #6979

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks stuck at "waiting" in 3.84.0
3 participants