post: new article: TCP-in-UDP

matttbe · matttbe · commit 6742771ca0d6 · 2025-07-14T17:01:08.000+02:00
To expose this tool and the complexities linked to its creation.

That might help users and devs in the same situation.

Signed-off-by: Matthieu Baerts (NGI0) &lt;matttbe@kernel.org&gt;
diff --git a/_posts/2025-07-14-TCP-in-UDP.md b/_posts/2025-07-14-TCP-in-UDP.md
@@ -0,0 +1,289 @@
+---
+layout: post
+title:  "Introducing TCP-in-UDP solution"
+---
+
+The MPTCP protocol is complex, mainly to be able to survive on the Internet
+where
+[middleboxes](https://datatracker.ietf.org/doc/html/rfc8684#name-interactions-with-middlebox)
+such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets.
+Worst case scenario, an MPTCP connection should fallback to "plain" TCP. Today,
+such fallbacks are rarer than before -- probably because MPTCP has been used
+since 2013 on millions of Apple smartphones worldwide -- but they can still
+exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs)
+where MPTCP connections are not bypassed. In such cases, a solution to continue
+benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions
+exist, but they usually add extra layers, and requires setting a virtual private
+network (VPN) up with private IP addresses between the client and the server.
+
+Here, a simpler solution is presented:
+[TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp). This solution relies
+on [eBPF](https://ebpf.io/), doesn't add extra data per packet, and doesn't
+require a virtual private network. Read on to find out more about that!
+
+<!--more-->
+
+--------------------------------------------------------------------------------
+
+> First, if the network you use blocks TCP extensions like MPTCP or other
+> protocols, the best thing to do is to contact your network operator: maybe
+> they are simply not aware of this issue, and can easily fix it.
+
+## TCP-in-UDP
+
+Many tunnel solutions exist, but they have other use-cases: getting access to
+private networks, eventually with encryptions -- with solutions like OpenVPN,
+IPSec, WireGuard®, etc. -- or to add extra info in each packet for routing
+purposes -- like GRE, GENEVE, etc. The Linux kernel
+[supports](https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels)
+many of these tunnels. In our case, the goal is not to get access to private
+networks and not to add an extra layer of encryption, but to make sure packets
+are not being modified by the network.
+
+For our use-case, it is then enough to "convert the TCP packets in UDP". This
+what [TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp) is doing. This
+idea is not new, it is inspired by an old [IETF
+draft](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html).
+In short, items from the TCP header are re-ordered to start with items from the
+UDP header.
+
+### TCP to UDP header
+
+To better understand the translation, let's see how the different headers look
+like:
+
+- [UDP](https://www.ietf.org/rfc/rfc768.html):
+
+```
+ 0                   1                   2                   3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|          Source Port          |       Destination Port        |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|            Length             |           Checksum            |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+```
+
+- [TCP](https://www.ietf.org/rfc/rfc9293.html):
+
+```
+ 0                   1                   2                   3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|          Source Port          |       Destination Port        |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|                        Sequence Number                        |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|                    Acknowledgment Number                      |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|  Data |       |C|E|U|A|P|R|S|F|                               |
+| Offset| Reser |R|C|R|C|S|S|Y|I|            Window             |
+|       |       |W|E|G|K|H|T|N|N|                               |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|           Checksum            |         Urgent Pointer        |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|                      (Optional) Options                       |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+```
+
+- [TCP-in-UDP](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html):
+
+```
+ 0                   1                   2                   3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|          Source Port          |       Destination Port        |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|            Length             |           Checksum            |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|  Data |       |C|E| |A|P|R|S|F|                               |
+| Offset| Reser |R|C|0|C|S|S|Y|I|            Window             |
+|       |       |W|E| |K|H|T|N|N|                               |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|                        Sequence Number                        |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|                    Acknowledgment Number                      |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+|                      (Optional) Options                       |
++-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+```
+
+As described
+[here](https://perso.uclouvain.be/olivier.bonaventure/blog/html/2013/07/04/tcp_over_udp.html),
+the first eight bytes of the TCP-in-UDP header correspond to the classical UDP
+header. Then, the Data Offset is placed with the flags and the window field.
+Placing the Data Offset after the Checksum ensures that a value larger than
+`0x5` will appear there, which is required for STUN traversal. Then, the
+sequence numbers and acknowledgment numbers follow. With this translation, the
+TCP header has been reordered, but starts with a UDP header without modifying
+the packet length. The informed reader will have noticed that the `URG` flag and
+the `Urgent Pointer` have disappeared. This field is rarely used and some
+middleboxes reset it. This is not a huge loss for most TCP applications.
+
+In other words, apart from a different order, the only two modifications are:
+
+- the layer 4 protocol indicated in layer 3 (IPv4/IPv6)
+- the switch from `Urgent Pointer` to `Length` (and the opposite)
+
+These two modifications will of course affect the Checksum field that will need
+to be updated accordingly.
+
+## Dealing with network stack optimisations
+
+On paper, the required modifications -- protocol, a 16-bit word, and adapt the
+checksum -- are small, and should be easy to do using eBPF with TC ingress and
+egress hooks. But doing that in a highly optimised stack is more complex than
+expected.
+
+### Accessing all required data
+
+On Linux, all per-packet data are stored in a socket buffer, or
+"[SKB](http://oldvger.kernel.org/~davem/skb.html)". In our case here, the eBPF
+code needs to access the packet header, which should be available between
+`skb->data` and `skb->data_end`. Except that, `skb->data_end` might not point to
+the end of the packet, but typically it points to the end of the packet header.
+This is an optimisation, because the kernel will often do operations depending
+on the packet header, and it doesn't really care about the content of the data,
+which is usually more for the userspace, or to be forwarded to another network
+interface.
+
+In our case, in egress -- translation from TCP to UDP -- it is fine: the whole
+TCP header is available, and that's where the modifications will need to be
+done. In ingress -- translation from UDP to TCP -- that's different: some
+network drivers will only align data going up to the end of the layer 4
+protocol, so the 8 bytes of the UDP header here. This is not enough to do the
+translation, as it is required to access the 12 more bytes. This issue is easy
+to fix: eBPF helpers were introduced a long time ago to pull in non-linear data,
+e.g. via
+[`bpf_skb_pull_data`](https://docs.ebpf.io/linux/helper-function/bpf_skb_pull_data/)
+or
+[`bpf_skb_load_bytes`](https://docs.ebpf.io/linux/helper-function/bpf_skb_load_bytes/).
+
+### GRO & TSO/GSO
+
+On the Internet, packets are usually limited to 1500 bytes or fewer. Each packet
+still needs to carry some headers to indicate the source and destination, but
+also per-packet information like the data sequence number. Having to deal with
+"small" packets has a cost which can be very high to deal with very high
+throughput. To counter that, the Linux networking stack will prefer to deal with
+bigger chunks of data, with "internal" packets of tens of kilobytes, and split
+the packet into smaller ones with very similar header later on. Some network
+devices can even do this segmentation or aggregation work in hardware. That's
+what GRO (Generic Receive Offload), and TSO (TCP Segmentation Offload) / GSO
+(Generic Segmentation Offload) are for.
+
+With TCP-in-UDP, it is required to act on a per-packet basis: each TCP packet
+will be translated to UDP, which will contain the UDP header (8 bytes), the rest
+of the TCP one (12 bytes + the TCP options), then the TCP payload. In other
+words, for each UDP packet, the UDP payload will contain a part of the TCP
+header: data that is per-packet specific. It means that the traditional GRO and
+TSO cannot be used because the data cannot "simply" be merged with the next one
+like before.
+
+Informed readers will then say that these network device features can be easily
+disabled using `ethtool`, e.g.
+
+```
+ethtool -K "${IFACE}" gro off gso off tso off
+```
+
+Correct, but even if all hardware offload accelerations are disabled, in egress,
+the Linux networking stack still has interest to deal with bigger packets
+internally, and do the segmentation in software at the end. Because it is not
+easily possible to modify how the segmentation will be done with eBPF, it is
+required to tell the stack not to do this optimisation, e.g. with:
+
+```
+ip link set "${IFACE}" gso_max_segs 1
+```
+
+### Checksum
+
+> The following was certainly the most frustrating issue to deal with!
+
+Thanks to how the checksum is
+[computed](https://datatracker.ietf.org/doc/html/rfc1071), moving some 16-bit
+words or bigger around doesn't change the checksum. Still, some fields need to
+be updated:
+
+- The layer 4 protocol, set in layer 3 (IPv4/IPv6) here, also used to compute
+  the next layer (UDP/TCP) checksum.
+- The switch from the TCP `Urgent Pointer` (`0`) to the UDP `Length` (and the
+  opposite).
+
+It is not required to recompute the full checksum. Instead, this can be done
+[incrementally](https://datatracker.ietf.org/doc/html/rfc1141), and some eBPF
+helpers can do that for us, e.g.
+[`bpf_l3_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l3_csum_replace/)
+and
+[`bpf_l4_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l4_csum_replace/).
+
+When testing with Network namespaces (`netns`) with one host dedicated to the
+translation when forwarding packets, everything was fine: the correct checksum
+was visible in each packet. But when testing with real hardware, with TCP-in-UDP
+eBPF hooks directly on the client and server, that was different: the checksum
+in egress was incorrect on most network interfaces, even when the transmission
+checksum offload (`tx`) was disabled on the network interface.
+
+After quite a bit of investigation, it appears that both the layer 3 and 4
+checksums were correctly updated by the eBPF hook, but either the NIC or the
+networking stack was modifying the layer 4 checksum at the wrong place. This
+deserves some explanation.
+
+In egress, the Linux TCP networking stack of the sender will typically set
+`skb->ip_summed` to `CHECKSUM_PARTIAL`. In short, it means the TCP/IP stack will
+compute a part of the checksum, only the one covering the
+[pseudo-header](https://www.ietf.org/rfc/rfc9293.html#section-3.1-6.18.1): IP
+addresses, protocol number and length. The rest will be computed later on,
+ideally by the networking device. At that last stage, the device only needs to
+know where the layer 4 starts in the packet, but also where the checksum field
+is from the start of this layer 4. This info is internally registered in
+`skb->csum_offset`, and it is different for TCP and UDP because the checksum
+field is not at the same place in their headers.
+
+When switching from UDP to TCP, it is then not enough to change the protocol
+number in the layer 3, this internal checksum offset value also needs to be
+updated. If I'm not mistaken, today, it is not possible to update it directly
+with eBPF. A proper solution is certainly to add a new eBPF helper, but that
+would only work with newer kernels, or eventually with a custom module. Instead,
+a workaround has been found: chain the eBPF TC egress hook with a TC `ACT_CSUM`
+action when the packet is translated from TCP to UDP. This [`csum`
+action](https://www.man7.org/linux/man-pages/man8/tc-csum.8.html) triggers a
+software checksum recalculation of the specified packet headers. In other words
+and in our case, it is used to compute the rest of the checksum for a given
+protocol (UDP), and mark the checksum as computed (`CHECKSUM_NONE`). This last
+step is important, because even if it is possible to compute the full checksum
+with eBPF code like we did at some point, it is wrong to do so if we cannot
+change the `CHECKSUM_PARTIAL` flag which expect a later stage to update a
+checksum at a (wrong) offset with the rest of the data.
+
+So with a combination of both TC `ACT_CSUM` and eBPF, it is possible to get the
+right checksum after having modified the layer 4 protocol.
+
+### MTU/MSS
+
+This is not linked to the highly optimised Linux network stack, but, on the
+wire, the packets will be in UDP and not TCP. It means that some operations like
+the dynamic adaptation of the MSS (TCP Maximum Segment Size) -- aka MSS clamping
+-- will have no effects here. Many mobile networks uses encapsulation without
+jumbo frames, meaning that the maximum size is lower than 1500 bytes. For
+performance reasons, and not to have to deal with this, it is important to avoid
+IP fragmentation. In other words, it might be required to adapt the interface
+Maximum Transmission Unit (MTU), or the
+[MTU](https://man.archlinux.org/man/ip-route.8.en#mtu) /
+[MSS](https://man.archlinux.org/man/ip-route.8.en#advmss) per destination.
+
+## Conclusion
+
+In conclusion, this new eBPF program can be easily deployed on both the client
+and server sides to circumvent middleboxes that are still blocking MPTCP or
+other protocols. All you might still need to do is to modify the destination
+port which is [currently
+hardcoded](https://github.com/multipath-tcp/tcp-in-udp/blob/cde92b1cf8588f7cd3932b204cd51e0596a07ade/tcp_in_udp_tc.c#L29).
+
+## Acknowledgments
+
+Thanks to [Xpedite Technologies](https://xpedite-tech.com) for having supported
+this work, and in particular [Chester](https://github.com/arinc9) for his help
+investigating the checksum issues with real hardware. Also thanks to Nickz from
+the eBPF.io community for his support while working on these checksum issues.