Skip to content

Conversation

@andersfugmann
Copy link
Owner

@andersfugmann andersfugmann commented May 1, 2025

Potential solution to multipart message handling if INTR or EAGAIN is received.
In general, EAGAIN should never be received according to documentation, but it has to be verified. Potentially sending large multipart messages that exceeds local zmq buffers may cause EAGAIN to be raised (or the call to block).

This branch is based on IssuuArchive/ocaml-zmq#135 by @rand00

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces resumable variants for multipart send/receive operations to handle EINTR/EAGAIN, updates downstream bindings to use them, and adds a test for non-blocking EAGAIN behavior.

  • Add resumable type and *_r functions, deprecating their unsafe counterparts.
  • Update zmq-eio, zmq-deferred, and zmq-async bindings to use resumable functions and retry on signals.
  • Add a test for EAGAIN in nonblocking receive and update CHANGES.md.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
zmq/test/zmq_test.ml Added test_zmq_eagain and updated test suite names to use resumable variants.
zmq/src/zmq.mli Declared 'a resumable and new *_r functions; deprecated old functions.
zmq/src/zmq.ml Implemented *_wrapper_resumable and new *_r functions.
zmq-eio/src/socket.ml Switched to *_r resumable calls and catch EINTR in queue processing.
zmq-deferred/src/socket.ml Updated post to use resumable calls and handle EINTR/EAGAIN.
zmq-async/src/deferred.ml Fixed Async_kernel.Ivar deprecation by using fill_exn.
CHANGES.md Documented new resumable functions, deprecations, and version bump.
Comments suppressed due to low confidence (9)

zmq/src/zmq.mli:17

  • Fix spelling of "corretly" to "correctly".
    If EINTR is not handled corretly, the socket may be left with a half written/read

zmq/src/zmq.mli:21

  • Remove the duplicated "the the" and correct grammar (e.g., "avoid the socket from going into").
    For this reason, if EINTR is raised its important to repeat the the call to the resumable to avoid the the socket go into a broken

zmq-deferred/src/socket.ml:201

  • Fix typo in comment: change "Recevie" to "Receive".
    (** Recevie all message blocks. *)

CHANGES.md:5

  • Correct typo "EAGIN" to "EAGAIN".
*  to allow resuming if the calls raise EAGIN or EINTR.

CHANGES.md:4

  • [nitpick] Reference the correct function names: send_msg_all_r and recv_msg_all_r instead of send_all_msg_r/recv_all_msg_r.
*  Add `send_all_r`, `send_all_msg_r`, `recv_all_r` and `recv_all_msg_r`

zmq/test/zmq_test.ml:391

  • [nitpick] Update test name to "zmq exception EINTR" for clarity and consistency with EAGAIN test.
"zmq exception intr" >:: test_zmq_exception;

zmq/test/zmq_test.ml:392

  • [nitpick] Capitalize "EAGAIN" in the test name for consistency.
"zmq exception eagain" >:: test_zmq_eagain;

zmq/src/zmq.ml:502

  • [nitpick] Add unit tests covering recv_all_wrapper_resumable to verify it correctly retries on EINTR/EAGAIN and gathers all parts.
let recv_all_wrapper_resumable: (?block:bool -> _ t -> 'a) -> ?block:bool -> _ t -> 'a list resumable = fun f ?block socket ->

zmq/src/zmq.ml:519

  • [nitpick] Add unit tests for send_all_wrapper_resumable to ensure multipart sends resume correctly after EINTR/EAGAIN.
let send_all_wrapper_resumable: (?block:bool -> ?more:bool -> _ t -> 'a -> unit) -> ?block:bool -> _ t -> 'a list -> unit resumable = fun f ?block socket messages ->

@andersfugmann
Copy link
Owner Author

A little digging on the internet shows that we are not the only ones with problems with multipart messages. See. http://hintjens.com/blog:84. Its interesting that they are were already pondering about deprecating multipart messages back 10 years ago due to the high complexity.

Reading that, I'm inclined to simply ignore some of the problems - Modify the core implementation to always retry on EINTR and send all following multipart blocks in blocking mode (So EAGAIN may only be raised from the initial send).

@rand00
Copy link

rand00 commented May 8, 2025

Reading that, I'm inclined to simply ignore some of the problems - Modify the core implementation to always retry on EINTR and send all following multipart blocks in blocking mode (So EAGAIN may only be raised from the initial send).

Interesting read - but concerning this PR my observations are:

  • the ZMQ repo doesn't have much activity anymore, so I don't want to expect a change in API
  • they don't explain if/how pub/sub would implement topics when multipart was gone
  • the article expresses the goal of backwards compatibility - so multipart messages would probably not go away anyway - and the pub/sub API would not change either

Also - I think the nicest API for ocaml-zmq would be one where the user can choose to observe when EINTR happens. This is e.g. either possible via your suggestion of user manually looping around send - or could be via some optional logger hook. I'm fine with either - but my arguments for a logger-based solution would be:

  • (I guess) user wants to always retry on EINTR - so there is not much reason to expose this error
  • zmq as an API feels very simple and a lot of stuff happens 'behind the scenes' - exposing EINTR goes against this philosophy (even though zmq docs says that you should always do error-handling on any API call)

@rand00
Copy link

rand00 commented May 8, 2025

You were mentioning tests of:

  • EINTR being handled in the correct way - so zmq always continues working (i.e. process not being stuck)
  • EINTR being handled doesn't lead to process not being able to be signaled in general

What are your thoughts on:

  • the type of test - aren't both of these tests probabilistic in nature - as signals can hit zmq at different points of the control-flow - should they then run for a specified amount of time - e.g. 10min?
  • the existence of EINTR in itself communicates that a signal has been ignored - and should possibly be handled by finding out what signal was sent after-the-fact. This seems a bit orthogonal to fixing that EINTR is common for zmq - and should usually just lead to a retry of the operation

Edit:
From what I've read - I don't see what the correct/clean solution to EINTR happening is. If it's possible that the causing signal is e.g. SIGTERM - then it's a programmer-error to "just retry" no matter what the situaion is...


Edit:
Ahh, as I understand it the signal handler always run - and EINTR only happens on non-fatal signals - so SIGINT/SIGTERM can't lead to EINTR - unless one has overridden the signal handler

@rand00
Copy link

rand00 commented May 8, 2025

I btw. just reran my application handling EINTR with the following changes:

  • switched from ipc to tcp
  • avoided wrapping applications in dune exec

.. and the same behaviour continues - including same kind of strace output

@rand00
Copy link

rand00 commented May 8, 2025

As a sidenote I found this related info page: info --node='(libc)Interrupted Primitives' - where SA_RESTART can be set via sigaction - which will lead to the read/write being automatically retried instead of failing with EINTR. Though in OCaml's stdlib we don't have this available.. And ocaml-zmq should probably not set signals-handlers itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants