|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Introducing TCP-in-UDP solution" |
| 4 | +--- |
| 5 | + |
| 6 | +The MPTCP protocol is complex, mainly to be able to survive on the Internet |
| 7 | +where |
| 8 | +[middleboxes](https://datatracker.ietf.org/doc/html/rfc8684#name-interactions-with-middlebox) |
| 9 | +such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets. |
| 10 | +Worst case scenario, an MPTCP connection should fallback to "plain" TCP. Today, |
| 11 | +such fallbacks are rarer than before -- probably because MPTCP has been used |
| 12 | +since 2013 on millions of Apple smartphones worldwide -- but they can still |
| 13 | +exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs) |
| 14 | +where MPTCP connections are not bypassed. In such cases, a solution to continue |
| 15 | +benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions |
| 16 | +exist, but they usually add extra layers, and requires setting a virtual private |
| 17 | +network (VPN) up with private IP addresses between the client and the server. |
| 18 | + |
| 19 | +Here, a simpler solution is presented: |
| 20 | +[TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp). This solution relies |
| 21 | +on [eBPF](https://ebpf.io/), doesn't add extra data per packet, and doesn't |
| 22 | +require a virtual private network. Read on to find out more about that! |
| 23 | + |
| 24 | +<!--more--> |
| 25 | + |
| 26 | +-------------------------------------------------------------------------------- |
| 27 | + |
| 28 | +> First, if the network you use blocks TCP extensions like MPTCP or other |
| 29 | +> protocols, the best thing to do is to contact your network operator: maybe |
| 30 | +> they are simply not aware of this issue, and can easily fix it. |
| 31 | +
|
| 32 | +## TCP-in-UDP |
| 33 | + |
| 34 | +Many tunnel solutions exist, but they have other use-cases: getting access to |
| 35 | +private networks, eventually with encryptions -- with solutions like OpenVPN, |
| 36 | +IPSec, WireGuard®, etc. -- or to add extra info in each packet for routing |
| 37 | +purposes -- like GRE, GENEVE, etc. The Linux kernel |
| 38 | +[supports](https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels) |
| 39 | +many of these tunnels. In our case, the goal is not to get access to private |
| 40 | +networks and not to add an extra layer of encryption, but to make sure packets |
| 41 | +are not being modified by the network. |
| 42 | + |
| 43 | +For our use-case, it is then enough to "convert the TCP packets in UDP". This |
| 44 | +what [TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp) is doing. This |
| 45 | +idea is not new, it is inspired by an old [IETF |
| 46 | +draft](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html). |
| 47 | +In short, items from the TCP header are re-ordered to start with items from the |
| 48 | +UDP header. |
| 49 | + |
| 50 | +### TCP to UDP header |
| 51 | + |
| 52 | +To better understand the translation, let's see how the different headers look |
| 53 | +like: |
| 54 | + |
| 55 | +- [UDP](https://www.ietf.org/rfc/rfc768.html): |
| 56 | + |
| 57 | +``` |
| 58 | + 0 1 2 3 |
| 59 | + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 60 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 61 | +| Source Port | Destination Port | |
| 62 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 63 | +| Length | Checksum | |
| 64 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 65 | +``` |
| 66 | + |
| 67 | +- [TCP](https://www.ietf.org/rfc/rfc9293.html): |
| 68 | + |
| 69 | +``` |
| 70 | + 0 1 2 3 |
| 71 | + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 72 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 73 | +| Source Port | Destination Port | |
| 74 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 75 | +| Sequence Number | |
| 76 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 77 | +| Acknowledgment Number | |
| 78 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 79 | +| Data | |C|E|U|A|P|R|S|F| | |
| 80 | +| Offset| Reser |R|C|R|C|S|S|Y|I| Window | |
| 81 | +| | |W|E|G|K|H|T|N|N| | |
| 82 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 83 | +| Checksum | Urgent Pointer | |
| 84 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 85 | +| (Optional) Options | |
| 86 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 87 | +``` |
| 88 | + |
| 89 | +- [TCP-in-UDP](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html): |
| 90 | + |
| 91 | +``` |
| 92 | + 0 1 2 3 |
| 93 | + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 94 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 95 | +| Source Port | Destination Port | |
| 96 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 97 | +| Length | Checksum | |
| 98 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 99 | +| Data | |C|E| |A|P|R|S|F| | |
| 100 | +| Offset| Reser |R|C|0|C|S|S|Y|I| Window | |
| 101 | +| | |W|E| |K|H|T|N|N| | |
| 102 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 103 | +| Sequence Number | |
| 104 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 105 | +| Acknowledgment Number | |
| 106 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 107 | +| (Optional) Options | |
| 108 | ++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 109 | +``` |
| 110 | + |
| 111 | +As described |
| 112 | +[here](https://perso.uclouvain.be/olivier.bonaventure/blog/html/2013/07/04/tcp_over_udp.html), |
| 113 | +the first eight bytes of the TCP-in-UDP header correspond to the classical UDP |
| 114 | +header. Then, the Data Offset is placed with the flags and the window field. |
| 115 | +Placing the Data Offset after the Checksum ensures that a value larger than |
| 116 | +`0x5` will appear there, which is required for STUN traversal. Then, the |
| 117 | +sequence numbers and acknowledgment numbers follow. With this translation, the |
| 118 | +TCP header has been reordered, but starts with a UDP header without modifying |
| 119 | +the packet length. The informed reader will have noticed that the `URG` flag and |
| 120 | +the `Urgent Pointer` have disappeared. This field is rarely used and some |
| 121 | +middleboxes reset it. This is not a huge loss for most TCP applications. |
| 122 | + |
| 123 | +In other words, apart from a different order, the only two modifications are: |
| 124 | + |
| 125 | +- the layer 4 protocol indicated in layer 3 (IPv4/IPv6) |
| 126 | +- the switch from `Urgent Pointer` to `Length` (and the opposite) |
| 127 | + |
| 128 | +These two modifications will of course affect the Checksum field that will need |
| 129 | +to be updated accordingly. |
| 130 | + |
| 131 | +## Dealing with network stack optimisations |
| 132 | + |
| 133 | +On paper, the required modifications -- protocol, a 16-bit word, and adapt the |
| 134 | +checksum -- are small, and should be easy to do using eBPF with TC ingress and |
| 135 | +egress hooks. But doing that in a highly optimised stack is more complex than |
| 136 | +expected. |
| 137 | + |
| 138 | +### Accessing all required data |
| 139 | + |
| 140 | +On Linux, all per-packet data are stored in a socket buffer, or |
| 141 | +"[SKB](http://oldvger.kernel.org/~davem/skb.html)". In our case here, the eBPF |
| 142 | +code needs to access the packet header, which should be available between |
| 143 | +`skb->data` and `skb->data_end`. Except that, `skb->data_end` might not point to |
| 144 | +the end of the packet, but typically it points to the end of the packet header. |
| 145 | +This is an optimisation, because the kernel will often do operations depending |
| 146 | +on the packet header, and it doesn't really care about the content of the data, |
| 147 | +which is usually more for the userspace, or to be forwarded to another network |
| 148 | +interface. |
| 149 | + |
| 150 | +In our case, in egress -- translation from TCP to UDP -- it is fine: the whole |
| 151 | +TCP header is available, and that's where the modifications will need to be |
| 152 | +done. In ingress -- translation from UDP to TCP -- that's different: some |
| 153 | +network drivers will only align data going up to the end of the layer 4 |
| 154 | +protocol, so the 8 bytes of the UDP header here. This is not enough to do the |
| 155 | +translation, as it is required to access the 12 more bytes. This issue is easy |
| 156 | +to fix: eBPF helpers were introduced a long time ago to pull in non-linear data, |
| 157 | +e.g. via |
| 158 | +[`bpf_skb_pull_data`](https://docs.ebpf.io/linux/helper-function/bpf_skb_pull_data/) |
| 159 | +or |
| 160 | +[`bpf_skb_load_bytes`](https://docs.ebpf.io/linux/helper-function/bpf_skb_load_bytes/). |
| 161 | + |
| 162 | +### GRO & TSO/GSO |
| 163 | + |
| 164 | +On the Internet, packets are usually limited to 1500 bytes or fewer. Each packet |
| 165 | +still needs to carry some headers to indicate the source and destination, but |
| 166 | +also per-packet information like the data sequence number. Having to deal with |
| 167 | +"small" packets has a cost which can be very high to deal with very high |
| 168 | +throughput. To counter that, the Linux networking stack will prefer to deal with |
| 169 | +bigger chunks of data, with "internal" packets of tens of kilobytes, and split |
| 170 | +the packet into smaller ones with very similar header later on. Some network |
| 171 | +devices can even do this segmentation or aggregation work in hardware. That's |
| 172 | +what GRO (Generic Receive Offload), and TSO (TCP Segmentation Offload) / GSO |
| 173 | +(Generic Segmentation Offload) are for. |
| 174 | + |
| 175 | +With TCP-in-UDP, it is required to act on a per-packet basis: each TCP packet |
| 176 | +will be translated to UDP, which will contain the UDP header (8 bytes), the rest |
| 177 | +of the TCP one (12 bytes + the TCP options), then the TCP payload. In other |
| 178 | +words, for each UDP packet, the UDP payload will contain a part of the TCP |
| 179 | +header: data that is per-packet specific. It means that the traditional GRO and |
| 180 | +TSO cannot be used because the data cannot "simply" be merged with the next one |
| 181 | +like before. |
| 182 | + |
| 183 | +Informed readers will then say that these network device features can be easily |
| 184 | +disabled using `ethtool`, e.g. |
| 185 | + |
| 186 | +``` |
| 187 | +ethtool -K "${IFACE}" gro off gso off tso off |
| 188 | +``` |
| 189 | + |
| 190 | +Correct, but even if all hardware offload accelerations are disabled, in egress, |
| 191 | +the Linux networking stack still has interest to deal with bigger packets |
| 192 | +internally, and do the segmentation in software at the end. Because it is not |
| 193 | +easily possible to modify how the segmentation will be done with eBPF, it is |
| 194 | +required to tell the stack not to do this optimisation, e.g. with: |
| 195 | + |
| 196 | +``` |
| 197 | +ip link set "${IFACE}" gso_max_segs 1 |
| 198 | +``` |
| 199 | + |
| 200 | +### Checksum |
| 201 | + |
| 202 | +> The following was certainly the most frustrating issue to deal with! |
| 203 | +
|
| 204 | +Thanks to how the checksum is |
| 205 | +[computed](https://datatracker.ietf.org/doc/html/rfc1071), moving some 16-bit |
| 206 | +words or bigger around doesn't change the checksum. Still, some fields need to |
| 207 | +be updated: |
| 208 | + |
| 209 | +- The layer 4 protocol, set in layer 3 (IPv4/IPv6) here, also used to compute |
| 210 | + the next layer (UDP/TCP) checksum. |
| 211 | +- The switch from the TCP `Urgent Pointer` (`0`) to the UDP `Length` (and the |
| 212 | + opposite). |
| 213 | + |
| 214 | +It is not required to recompute the full checksum. Instead, this can be done |
| 215 | +[incrementally](https://datatracker.ietf.org/doc/html/rfc1141), and some eBPF |
| 216 | +helpers can do that for us, e.g. |
| 217 | +[`bpf_l3_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l3_csum_replace/) |
| 218 | +and |
| 219 | +[`bpf_l4_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l4_csum_replace/). |
| 220 | + |
| 221 | +When testing with Network namespaces (`netns`) with one host dedicated to the |
| 222 | +translation when forwarding packets, everything was fine: the correct checksum |
| 223 | +was visible in each packet. But when testing with real hardware, with TCP-in-UDP |
| 224 | +eBPF hooks directly on the client and server, that was different: the checksum |
| 225 | +in egress was incorrect on most network interfaces, even when the transmission |
| 226 | +checksum offload (`tx`) was disabled on the network interface. |
| 227 | + |
| 228 | +After quite a bit of investigation, it appears that both the layer 3 and 4 |
| 229 | +checksums were correctly updated by the eBPF hook, but either the NIC or the |
| 230 | +networking stack was modifying the layer 4 checksum at the wrong place. This |
| 231 | +deserves some explanation. |
| 232 | + |
| 233 | +In egress, the Linux TCP networking stack of the sender will typically set |
| 234 | +`skb->ip_summed` to `CHECKSUM_PARTIAL`. In short, it means the TCP/IP stack will |
| 235 | +compute a part of the checksum, only the one covering the |
| 236 | +[pseudo-header](https://www.ietf.org/rfc/rfc9293.html#section-3.1-6.18.1): IP |
| 237 | +addresses, protocol number and length. The rest will be computed later on, |
| 238 | +ideally by the networking device. At that last stage, the device only needs to |
| 239 | +know where the layer 4 starts in the packet, but also where the checksum field |
| 240 | +is from the start of this layer 4. This info is internally registered in |
| 241 | +`skb->csum_offset`, and it is different for TCP and UDP because the checksum |
| 242 | +field is not at the same place in their headers. |
| 243 | + |
| 244 | +When switching from UDP to TCP, it is then not enough to change the protocol |
| 245 | +number in the layer 3, this internal checksum offset value also needs to be |
| 246 | +updated. If I'm not mistaken, today, it is not possible to update it directly |
| 247 | +with eBPF. A proper solution is certainly to add a new eBPF helper, but that |
| 248 | +would only work with newer kernels, or eventually with a custom module. Instead, |
| 249 | +a workaround has been found: chain the eBPF TC egress hook with a TC `ACT_CSUM` |
| 250 | +action when the packet is translated from TCP to UDP. This [`csum` |
| 251 | +action](https://www.man7.org/linux/man-pages/man8/tc-csum.8.html) triggers a |
| 252 | +software checksum recalculation of the specified packet headers. In other words |
| 253 | +and in our case, it is used to compute the rest of the checksum for a given |
| 254 | +protocol (UDP), and mark the checksum as computed (`CHECKSUM_NONE`). This last |
| 255 | +step is important, because even if it is possible to compute the full checksum |
| 256 | +with eBPF code like we did at some point, it is wrong to do so if we cannot |
| 257 | +change the `CHECKSUM_PARTIAL` flag which expect a later stage to update a |
| 258 | +checksum at a (wrong) offset with the rest of the data. |
| 259 | + |
| 260 | +So with a combination of both TC `ACT_CSUM` and eBPF, it is possible to get the |
| 261 | +right checksum after having modified the layer 4 protocol. |
| 262 | + |
| 263 | +### MTU/MSS |
| 264 | + |
| 265 | +This is not linked to the highly optimised Linux network stack, but, on the |
| 266 | +wire, the packets will be in UDP and not TCP. It means that some operations like |
| 267 | +the dynamic adaptation of the MSS (TCP Maximum Segment Size) -- aka MSS clamping |
| 268 | +-- will have no effects here. Many mobile networks uses encapsulation without |
| 269 | +jumbo frames, meaning that the maximum size is lower than 1500 bytes. For |
| 270 | +performance reasons, and not to have to deal with this, it is important to avoid |
| 271 | +IP fragmentation. In other words, it might be required to adapt the interface |
| 272 | +Maximum Transmission Unit (MTU), or the |
| 273 | +[MTU](https://man.archlinux.org/man/ip-route.8.en#mtu) / |
| 274 | +[MSS](https://man.archlinux.org/man/ip-route.8.en#advmss) per destination. |
| 275 | + |
| 276 | +## Conclusion |
| 277 | + |
| 278 | +In conclusion, this new eBPF program can be easily deployed on both the client |
| 279 | +and server sides to circumvent middleboxes that are still blocking MPTCP or |
| 280 | +other protocols. All you might still need to do is to modify the destination |
| 281 | +port which is [currently |
| 282 | +hardcoded](https://github.com/multipath-tcp/tcp-in-udp/blob/cde92b1cf8588f7cd3932b204cd51e0596a07ade/tcp_in_udp_tc.c#L29). |
| 283 | + |
| 284 | +## Acknowledgments |
| 285 | + |
| 286 | +Thanks to [Xpedite Technologies](https://xpedite-tech.com) for having supported |
| 287 | +this work, and in particular [Chester](https://github.com/arinc9) for his help |
| 288 | +investigating the checksum issues with real hardware. Also thanks to Nickz from |
| 289 | +the eBPF.io community for his support while working on these checksum issues. |
0 commit comments