Skip to content

Commit fc5266b

Browse files
committed
man/io_uring_internal: Add man page about relevant internals for users
Adds a man page with details about the inner workings of io_uring that are likely to be useful for users as they relate to frequently misused flags of io_uring such as IOSQE_ASYNC and the taskrun flags. This mostly describes what needs to be done on the kernel side for each request, who does the work and most notably what the async punt is. Signed-off-by: Constantin Pestka <[email protected]>
1 parent 206650f commit fc5266b

File tree

1 file changed

+225
-0
lines changed

1 file changed

+225
-0
lines changed

man/io_uring_internals.7

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual"
2+
.SH NAME
3+
io_uring_internals
4+
.SH SYNOPSIS
5+
.nf
6+
.B "#include <linux/io_uring.h>"
7+
.fi
8+
.PP
9+
.SH DESCRIPTION
10+
.PP
11+
.B io_uring
12+
is a linux specific, asynchronous API that allows the submission of requests to
13+
the kernel that are typically otherwise performed via a syscall. Requests are
14+
passed to the kernel via a shared ring buffer the
15+
.I Submission Queue
16+
(SQ) and completion notifications are passed back to the application via the
17+
.I Completion Queue
18+
(CQ). An important detail here is that after a request has been submitted to
19+
the kernel some CPU time has to be spent in kernel space to perform the
20+
required submission and completion related tasks.
21+
The mechanism used to provide this CPU time, as well as what process does so
22+
and when is different in
23+
.I io_uring
24+
than for the traditional API provided by regular syscalls.
25+
26+
.PP
27+
.SH Traditional Syscall Driven I/O
28+
.PP
29+
For regular syscalls the CPU time for these tasks is directly provided by the
30+
process issuing the syscall, with the submission side tasks in kernel space
31+
being directly executed after the context switch. The time for completion
32+
related tasks is either also subsequently directly provided in the case of
33+
polled I/O. In the case of interrupt driven I/O the CPU time is provided,
34+
depending on the driver in question, by either the traditional top and bottom
35+
half IRQ approach or via threaded IRQ handling. The CPU time for completion
36+
tasks is thus in this case provided by the CPU on which the hardware
37+
interrupt arrives, as well as the CPU to which the dedicated kernel worker
38+
thread for the threaded IRQ handling gets scheduled, if that is used.
39+
40+
.PP
41+
.SH The Submission Side Work
42+
.PP
43+
44+
The tasks required in kernel space on the submission side are mostly checking
45+
the SQ for newly arrived SQEs, parsing and check them for validity and
46+
permissions and then passing them on to the responsible system, such as a
47+
block device driver. An important note here is that
48+
.I io_uring
49+
guarantees that the process of submitting the request to responsible subsystem
50+
and thus in this case the
51+
.IR io_uring_enter (2)
52+
syscall made to submit the new requests, will never block. However,
53+
.I io_uring
54+
relies on the capabilities of the responsible system to perform the submission
55+
without blocking.
56+
.I io_uring
57+
will first attempt to submit the request without blocking.
58+
If this fails, e.g. due to the respective system not supporting non-blocking
59+
submissions,
60+
.I io_uring
61+
will
62+
.I async punt
63+
the request, i.e. off-load these requests to the
64+
.I IO work queue
65+
(IO WQ) (see description below).
66+
67+
.PP
68+
.SH The Completion Side Work
69+
.PP
70+
71+
The tasks required in kernel space on the completion side mostly come in the
72+
form of various request type dependant tasks, such as copying buffers, parsing
73+
packet headers etc., as well as posting a CQE to the CQ to inform the
74+
application of the completion of the request.
75+
76+
.PP
77+
.SH Who does the work
78+
.PP
79+
80+
One of
81+
the primary motivations behind
82+
.I io_uring
83+
was to reduce or entirely avoid the overheads of syscalls to provide the
84+
required CPU time in kernel space. The mechanism that
85+
.I io_uring
86+
utilizes to achieve this differs depending on the configuration with different
87+
trade-offs between configurations in respect to e.g. CPU efficiency and latency.
88+
89+
With the default configuration the primary mechanism to provide the kernel space
90+
CPU time in
91+
.I io_uring
92+
is also a syscall:
93+
.IR io_uring_enter (2)
94+
This still differs from requests made via their respective syscall directly,
95+
such as
96+
.IR read (2),
97+
in the sense that it allows for batching in a more flexible way than e.g.
98+
possible via
99+
.IR readv (2),
100+
as different syscalls types can be freely mixed and matched and chains of
101+
dependant requests, such as a
102+
.IR send (2)
103+
followed by a
104+
.IR recv (2)
105+
can be submitted with one syscall. Furthermore it is possible to both process
106+
requests for submissions and process arrived completions within the same
107+
.IR io_uring_enter (2)
108+
call. Applications can set the flag
109+
.I IORING_ENTER_GETEVENTS
110+
to in addition to processing any pending submissions, process any arrived
111+
completions and
112+
optionally wait until a specified amount of completions have arrived before
113+
returning.
114+
115+
If polled I/O is used all completion related work is performed during the
116+
.IR io_uring_enter (2)
117+
call. For interrupt driven I/O, the CPU receiving the hardware interrupt
118+
schedules the remaining work to be performed including posting the CQE to be
119+
performed via task work. Any outstanding task work is performed during any
120+
user-kernel space transition. Per default, the CPU that received the hw
121+
interrupt will after scheduling the task work interrupt a user space process
122+
via an inter processor interrupt (IPI), which will cause it to enter the kernel,
123+
and thus perform the scheduled work. While this ensures a timely delivery of
124+
the CQE, it is a relatively disruptive and high overhead operation. To avoid
125+
this applications can configure
126+
.I io_uring
127+
via
128+
.I IORING_SETUP_COOP_TASKRUN
129+
to elide the IPI. Applications must now ensure that they perform any syscall
130+
ever so often to be able to observe new completions, but benefit from eliding
131+
the overheads of the IPIs. Additionally
132+
.I io_uring
133+
can be configured to inform an application about the fact that it should now
134+
perform any syscall to reap new completions by setting
135+
.IR IORING_SETUP_TASKRUN_FLAG .
136+
This will result in
137+
.I io_uring
138+
setting
139+
.I IORING_SQ_TASKRUN
140+
in the SQ flags once the application should do so. This mechanism can be
141+
restricted further via
142+
.IR IORING_SETUP_DEFER_TASKRUN ,
143+
which results in the task work only being executed when
144+
.IR io_uring_enter (2)
145+
is called with
146+
.I IORING_ENTER_GETEVENTS
147+
set, rather than at any context switch, which gives the application more agency
148+
about when the work is executed, thus enabling e.g. more opportunities for
149+
batching.
150+
151+
.PP
152+
.SH Submission Queue Polling
153+
.PP
154+
155+
Sq polling introduces a dedicated kernel thread that performs essentially all
156+
submission and completion related tasks from fetching SQEs from the SQ,
157+
submitting requests, polling requests, if configured for I/O poll and posting
158+
CQEs. Notably, async punt requests are still processed by the IO WQ, to not
159+
hinder the progress of other requests. If the SQ thread does not have any work
160+
to do for a user supplied timeout it goes to sleep. Sq polling removes the need
161+
for any syscall during operation, besides waking up the sq thread after long
162+
periods of inactivity and thus reduces per request overheads at the cost of a
163+
high constant upkeep cost.
164+
165+
.PP
166+
.SH IO Work Queue
167+
.PP
168+
169+
The IO WQ is a kernel thread pool used to execute any requests that can not be
170+
submitted in a non-blocking way to the underlying subsystem, due to missing
171+
support in said subsystem. After either the sq poll thread or a user space
172+
thread calling
173+
.IR io_uring_enter (2)
174+
fails the initial attempt to submit the request without blocking it passes the
175+
request on to a IO WQ thread that then performs the blocking submission. While
176+
this mechanism ensures that
177+
.IR io_uring ,
178+
unlike e.g. AIO, never blocks on any of the submission paths, it is, as the
179+
name of this mechanism, the async punt, suggests not ideal. The blocking
180+
nature of the submission, the passing of the request to another thread, as
181+
well as the scheduling of the IO WQ threads are all ideally avoided
182+
overheads. Significant IO WQ activity can thus be seen as an indicator that
183+
something is very likely going wrong. Similarly the flag
184+
.I IOSQE_ASYNC
185+
should only be used if the user knows that a request will always or is very
186+
likely to async punt and not to ensure that the submission will not block, as
187+
.I io_uring
188+
guarantees to never block in any case.
189+
190+
.PP
191+
.SH Kernel Thread Management
192+
.PP
193+
194+
Each user space process utilizing
195+
.I io_uring
196+
posses an
197+
.I io_uring
198+
context, which manages all
199+
.I io_uring
200+
instances created within said process via
201+
.IR io_uring_setup (2).
202+
Per default, both the sq poll thread, as well as the IO WQ thread pool are
203+
dedicated for each
204+
.I io_uring
205+
instance and are thus not shared within a process and are never shared between
206+
different processes. However sharing these between two or more instances can
207+
be achieved during setup via
208+
.IR IORING_SETUP_ATTACH_WQ .
209+
The threads of the IO WQ are created lazily in response to request being async
210+
punted and fall into two accounts, the
211+
bounded account responsible for requests with a generally bounded execution
212+
time, such as block I/O and the unbounded account for requests with unbounded
213+
execution time such as e.g. recv operations.
214+
The maximum thread count of the accounts is per default 2 * NPROC and can be
215+
adjusted via
216+
.IR IORING_REGISTER_IOWQ_MAX_WORKERS .
217+
Their CPU affinity can be adjusted via
218+
.IR IORING_REGISTER_IOWQ_AFF .
219+
220+
.EE
221+
.SH SEE ALSO
222+
.BR io_uring (7)
223+
.BR io_uring_enter (2)
224+
.BR io_uring_register (2)
225+
.BR io_uring_setup (2)

0 commit comments

Comments
 (0)