|
| 1 | +.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual" |
| 2 | +.SH NAME |
| 3 | +io_uring_internals |
| 4 | +.SH SYNOPSIS |
| 5 | +.nf |
| 6 | +.B "#include <linux/io_uring.h>" |
| 7 | +.fi |
| 8 | +.PP |
| 9 | +.SH DESCRIPTION |
| 10 | +.PP |
| 11 | +.B io_uring |
| 12 | +is a linux specific, asynchronous API that allows the submission of requests to |
| 13 | +the kernel that are typically otherwise performed via a syscall. Requests are |
| 14 | +passed to the kernel via a shared ring buffer the |
| 15 | +.I Submission Queue |
| 16 | +(SQ) and completion notifications are passed back to the application via the |
| 17 | +.I Completion Queue |
| 18 | +(CQ). An important detail here is that after a request has been submitted to |
| 19 | +the kernel some CPU time has to be spent in kernel space to perform the |
| 20 | +required submission and completion related tasks. |
| 21 | +The mechanism used to provide this CPU time, as well as what process does so |
| 22 | +and when is different in |
| 23 | +.I io_uring |
| 24 | +than for the traditional API provided by regular syscalls. |
| 25 | + |
| 26 | +.PP |
| 27 | +.SH Traditional Syscall Driven I/O |
| 28 | +.PP |
| 29 | +For regular syscalls the CPU time for these tasks is directly provided by the |
| 30 | +process issuing the syscall, with the submission side tasks in kernel space |
| 31 | +being directly executed after the context switch. The time for completion |
| 32 | +related tasks is either also subsequently directly provided in the case of |
| 33 | +polled I/O. In the case of interrupt driven I/O the CPU time is provided, |
| 34 | +depending on the driver in question, by either the traditional top and bottom |
| 35 | +half IRQ approach or via threaded IRQ handling. The CPU time for completion |
| 36 | +tasks is thus in this case provided by the CPU on which the hardware |
| 37 | +interrupt arrives, as well as the CPU to which the dedicated kernel worker |
| 38 | +thread for the threaded IRQ handling gets scheduled, if that is used. |
| 39 | + |
| 40 | +.PP |
| 41 | +.SH The Submission Side Work |
| 42 | +.PP |
| 43 | + |
| 44 | +The tasks required in kernel space on the submission side are mostly checking |
| 45 | +the SQ for newly arrived SQEs, parsing and check them for validity and |
| 46 | +permissions and then passing them on to the responsible system, such as a |
| 47 | +block device driver. An important note here is that |
| 48 | +.I io_uring |
| 49 | +guarantees that the process of submitting the request to responsible subsystem |
| 50 | +and thus in this case the |
| 51 | +.IR io_uring_enter (2) |
| 52 | +syscall made to submit the new requests, will never block. However, |
| 53 | +.I io_uring |
| 54 | +relies on the capabilities of the responsible system to perform the submission |
| 55 | +without blocking. |
| 56 | +.I io_uring |
| 57 | +will first attempt to submit the request without blocking. |
| 58 | +If this fails, e.g. due to the respective system not supporting non-blocking |
| 59 | +submissions, |
| 60 | +.I io_uring |
| 61 | +will |
| 62 | +.I async punt |
| 63 | +the request, i.e. off-load these requests to the |
| 64 | +.I IO work queue |
| 65 | +(IO WQ) (see description below). |
| 66 | + |
| 67 | +.PP |
| 68 | +.SH The Completion Side Work |
| 69 | +.PP |
| 70 | + |
| 71 | +The tasks required in kernel space on the completion side mostly come in the |
| 72 | +form of various request type dependant tasks, such as copying buffers, parsing |
| 73 | +packet headers etc., as well as posting a CQE to the CQ to inform the |
| 74 | +application of the completion of the request. |
| 75 | + |
| 76 | +.PP |
| 77 | +.SH Who does the work |
| 78 | +.PP |
| 79 | + |
| 80 | +One of |
| 81 | +the primary motivations behind |
| 82 | +.I io_uring |
| 83 | +was to reduce or entirely avoid the overheads of syscalls to provide the |
| 84 | +required CPU time in kernel space. The mechanism that |
| 85 | +.I io_uring |
| 86 | +utilizes to achieve this differs depending on the configuration with different |
| 87 | +trade-offs between configurations in respect to e.g. CPU efficiency and latency. |
| 88 | + |
| 89 | +With the default configuration the primary mechanism to provide the kernel space |
| 90 | +CPU time in |
| 91 | +.I io_uring |
| 92 | +is also a syscall: |
| 93 | +.IR io_uring_enter (2) |
| 94 | +This still differs from requests made via their respective syscall directly, |
| 95 | +such as |
| 96 | +.IR read (2), |
| 97 | +in the sense that it allows for batching in a more flexible way than e.g. |
| 98 | +possible via |
| 99 | +.IR readv (2), |
| 100 | +as different syscalls types can be freely mixed and matched and chains of |
| 101 | +dependant requests, such as a |
| 102 | +.IR send (2) |
| 103 | +followed by a |
| 104 | +.IR recv (2) |
| 105 | +can be submitted with one syscall. Furthermore it is possible to both process |
| 106 | +requests for submissions and process arrived completions within the same |
| 107 | +.IR io_uring_enter (2) |
| 108 | +call. Applications can set the flag |
| 109 | +.I IORING_ENTER_GETEVENTS |
| 110 | +to in addition to processing any pending submissions, process any arrived |
| 111 | +completions and |
| 112 | +optionally wait until a specified amount of completions have arrived before |
| 113 | +returning. |
| 114 | + |
| 115 | +If polled I/O is used all completion related work is performed during the |
| 116 | +.IR io_uring_enter (2) |
| 117 | +call. For interrupt driven I/O, the CPU receiving the hardware interrupt |
| 118 | +schedules the remaining work to be performed including posting the CQE to be |
| 119 | +performed via task work. Any outstanding task work is performed during any |
| 120 | +user-kernel space transition. Per default, the CPU that received the hw |
| 121 | +interrupt will after scheduling the task work interrupt a user space process |
| 122 | +via an inter processor interrupt (IPI), which will cause it to enter the kernel, |
| 123 | +and thus perform the scheduled work. While this ensures a timely delivery of |
| 124 | +the CQE, it is a relatively disruptive and high overhead operation. To avoid |
| 125 | +this applications can configure |
| 126 | +.I io_uring |
| 127 | +via |
| 128 | +.I IORING_SETUP_COOP_TASKRUN |
| 129 | +to elide the IPI. Applications must now ensure that they perform any syscall |
| 130 | +ever so often to be able to observe new completions, but benefit from eliding |
| 131 | +the overheads of the IPIs. Additionally |
| 132 | +.I io_uring |
| 133 | +can be configured to inform an application about the fact that it should now |
| 134 | +perform any syscall to reap new completions by setting |
| 135 | +.IR IORING_SETUP_TASKRUN_FLAG . |
| 136 | +This will result in |
| 137 | +.I io_uring |
| 138 | +setting |
| 139 | +.I IORING_SQ_TASKRUN |
| 140 | +in the SQ flags once the application should do so. This mechanism can be |
| 141 | +restricted further via |
| 142 | +.IR IORING_SETUP_DEFER_TASKRUN , |
| 143 | +which results in the task work only being executed when |
| 144 | +.IR io_uring_enter (2) |
| 145 | +is called with |
| 146 | +.I IORING_ENTER_GETEVENTS |
| 147 | +set, rather than at any context switch, which gives the application more agency |
| 148 | +about when the work is executed, thus enabling e.g. more opportunities for |
| 149 | +batching. |
| 150 | + |
| 151 | +.PP |
| 152 | +.SH Submission Queue Polling |
| 153 | +.PP |
| 154 | + |
| 155 | +Sq polling introduces a dedicated kernel thread that performs essentially all |
| 156 | +submission and completion related tasks from fetching SQEs from the SQ, |
| 157 | +submitting requests, polling requests, if configured for I/O poll and posting |
| 158 | +CQEs. Notably, async punt requests are still processed by the IO WQ, to not |
| 159 | +hinder the progress of other requests. If the SQ thread does not have any work |
| 160 | +to do for a user supplied timeout it goes to sleep. Sq polling removes the need |
| 161 | +for any syscall during operation, besides waking up the sq thread after long |
| 162 | +periods of inactivity and thus reduces per request overheads at the cost of a |
| 163 | +high constant upkeep cost. |
| 164 | + |
| 165 | +.PP |
| 166 | +.SH IO Work Queue |
| 167 | +.PP |
| 168 | + |
| 169 | +The IO WQ is a kernel thread pool used to execute any requests that can not be |
| 170 | +submitted in a non-blocking way to the underlying subsystem, due to missing |
| 171 | +support in said subsystem. After either the sq poll thread or a user space |
| 172 | +thread calling |
| 173 | +.IR io_uring_enter (2) |
| 174 | +fails the initial attempt to submit the request without blocking it passes the |
| 175 | +request on to a IO WQ thread that then performs the blocking submission. While |
| 176 | +this mechanism ensures that |
| 177 | +.IR io_uring , |
| 178 | +unlike e.g. AIO, never blocks on any of the submission paths, it is, as the |
| 179 | +name of this mechanism, the async punt, suggests not ideal. The blocking |
| 180 | +nature of the submission, the passing of the request to another thread, as |
| 181 | +well as the scheduling of the IO WQ threads are all ideally avoided |
| 182 | +overheads. Significant IO WQ activity can thus be seen as an indicator that |
| 183 | +something is very likely going wrong. Similarly the flag |
| 184 | +.I IOSQE_ASYNC |
| 185 | +should only be used if the user knows that a request will always or is very |
| 186 | +likely to async punt and not to ensure that the submission will not block, as |
| 187 | +.I io_uring |
| 188 | +guarantees to never block in any case. |
| 189 | + |
| 190 | +.PP |
| 191 | +.SH Kernel Thread Management |
| 192 | +.PP |
| 193 | + |
| 194 | +Each user space process utilizing |
| 195 | +.I io_uring |
| 196 | +posses an |
| 197 | +.I io_uring |
| 198 | +context, which manages all |
| 199 | +.I io_uring |
| 200 | +instances created within said process via |
| 201 | +.IR io_uring_setup (2). |
| 202 | +Per default, both the sq poll thread, as well as the IO WQ thread pool are |
| 203 | +dedicated for each |
| 204 | +.I io_uring |
| 205 | +instance and are thus not shared within a process and are never shared between |
| 206 | +different processes. However sharing these between two or more instances can |
| 207 | +be achieved during setup via |
| 208 | +.IR IORING_SETUP_ATTACH_WQ . |
| 209 | +The threads of the IO WQ are created lazily in response to request being async |
| 210 | +punted and fall into two accounts, the |
| 211 | +bounded account responsible for requests with a generally bounded execution |
| 212 | +time, such as block I/O and the unbounded account for requests with unbounded |
| 213 | +execution time such as e.g. recv operations. |
| 214 | +The maximum thread count of the accounts is per default 2 * NPROC and can be |
| 215 | +adjusted via |
| 216 | +.IR IORING_REGISTER_IOWQ_MAX_WORKERS . |
| 217 | +Their CPU affinity can be adjusted via |
| 218 | +.IR IORING_REGISTER_IOWQ_AFF . |
| 219 | + |
| 220 | +.EE |
| 221 | +.SH SEE ALSO |
| 222 | +.BR io_uring (7) |
| 223 | +.BR io_uring_enter (2) |
| 224 | +.BR io_uring_register (2) |
| 225 | +.BR io_uring_setup (2) |
0 commit comments