Skip to content

Commit 6742771

Browse files
committed
post: new article: TCP-in-UDP
To expose this tool and the complexities linked to its creation. That might help users and devs in the same situation. Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
1 parent e5d4908 commit 6742771

File tree

1 file changed

+289
-0
lines changed

1 file changed

+289
-0
lines changed

_posts/2025-07-14-TCP-in-UDP.md

Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
---
2+
layout: post
3+
title: "Introducing TCP-in-UDP solution"
4+
---
5+
6+
The MPTCP protocol is complex, mainly to be able to survive on the Internet
7+
where
8+
[middleboxes](https://datatracker.ietf.org/doc/html/rfc8684#name-interactions-with-middlebox)
9+
such as NATs, firewalls, IDS or proxies can modify parts of the TCP packets.
10+
Worst case scenario, an MPTCP connection should fallback to "plain" TCP. Today,
11+
such fallbacks are rarer than before -- probably because MPTCP has been used
12+
since 2013 on millions of Apple smartphones worldwide -- but they can still
13+
exist, e.g. on some mobile networks using Performance Enhancing Proxies (PEPs)
14+
where MPTCP connections are not bypassed. In such cases, a solution to continue
15+
benefiting from MPTCP is to tunnel the MPTCP connections. Different solutions
16+
exist, but they usually add extra layers, and requires setting a virtual private
17+
network (VPN) up with private IP addresses between the client and the server.
18+
19+
Here, a simpler solution is presented:
20+
[TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp). This solution relies
21+
on [eBPF](https://ebpf.io/), doesn't add extra data per packet, and doesn't
22+
require a virtual private network. Read on to find out more about that!
23+
24+
<!--more-->
25+
26+
--------------------------------------------------------------------------------
27+
28+
> First, if the network you use blocks TCP extensions like MPTCP or other
29+
> protocols, the best thing to do is to contact your network operator: maybe
30+
> they are simply not aware of this issue, and can easily fix it.
31+
32+
## TCP-in-UDP
33+
34+
Many tunnel solutions exist, but they have other use-cases: getting access to
35+
private networks, eventually with encryptions -- with solutions like OpenVPN,
36+
IPSec, WireGuard®, etc. -- or to add extra info in each packet for routing
37+
purposes -- like GRE, GENEVE, etc. The Linux kernel
38+
[supports](https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels)
39+
many of these tunnels. In our case, the goal is not to get access to private
40+
networks and not to add an extra layer of encryption, but to make sure packets
41+
are not being modified by the network.
42+
43+
For our use-case, it is then enough to "convert the TCP packets in UDP". This
44+
what [TCP-in-UDP](https://github.com/multipath-tcp/tcp-in-udp) is doing. This
45+
idea is not new, it is inspired by an old [IETF
46+
draft](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html).
47+
In short, items from the TCP header are re-ordered to start with items from the
48+
UDP header.
49+
50+
### TCP to UDP header
51+
52+
To better understand the translation, let's see how the different headers look
53+
like:
54+
55+
- [UDP](https://www.ietf.org/rfc/rfc768.html):
56+
57+
```
58+
0 1 2 3
59+
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
60+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
61+
| Source Port | Destination Port |
62+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
63+
| Length | Checksum |
64+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
65+
```
66+
67+
- [TCP](https://www.ietf.org/rfc/rfc9293.html):
68+
69+
```
70+
0 1 2 3
71+
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
72+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
73+
| Source Port | Destination Port |
74+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
75+
| Sequence Number |
76+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
77+
| Acknowledgment Number |
78+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
79+
| Data | |C|E|U|A|P|R|S|F| |
80+
| Offset| Reser |R|C|R|C|S|S|Y|I| Window |
81+
| | |W|E|G|K|H|T|N|N| |
82+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
83+
| Checksum | Urgent Pointer |
84+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
85+
| (Optional) Options |
86+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
87+
```
88+
89+
- [TCP-in-UDP](https://datatracker.ietf.org/doc/html/draft-cheshire-tcp-over-udp-00.html):
90+
91+
```
92+
0 1 2 3
93+
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
94+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
95+
| Source Port | Destination Port |
96+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
97+
| Length | Checksum |
98+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
99+
| Data | |C|E| |A|P|R|S|F| |
100+
| Offset| Reser |R|C|0|C|S|S|Y|I| Window |
101+
| | |W|E| |K|H|T|N|N| |
102+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
103+
| Sequence Number |
104+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
105+
| Acknowledgment Number |
106+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
107+
| (Optional) Options |
108+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
109+
```
110+
111+
As described
112+
[here](https://perso.uclouvain.be/olivier.bonaventure/blog/html/2013/07/04/tcp_over_udp.html),
113+
the first eight bytes of the TCP-in-UDP header correspond to the classical UDP
114+
header. Then, the Data Offset is placed with the flags and the window field.
115+
Placing the Data Offset after the Checksum ensures that a value larger than
116+
`0x5` will appear there, which is required for STUN traversal. Then, the
117+
sequence numbers and acknowledgment numbers follow. With this translation, the
118+
TCP header has been reordered, but starts with a UDP header without modifying
119+
the packet length. The informed reader will have noticed that the `URG` flag and
120+
the `Urgent Pointer` have disappeared. This field is rarely used and some
121+
middleboxes reset it. This is not a huge loss for most TCP applications.
122+
123+
In other words, apart from a different order, the only two modifications are:
124+
125+
- the layer 4 protocol indicated in layer 3 (IPv4/IPv6)
126+
- the switch from `Urgent Pointer` to `Length` (and the opposite)
127+
128+
These two modifications will of course affect the Checksum field that will need
129+
to be updated accordingly.
130+
131+
## Dealing with network stack optimisations
132+
133+
On paper, the required modifications -- protocol, a 16-bit word, and adapt the
134+
checksum -- are small, and should be easy to do using eBPF with TC ingress and
135+
egress hooks. But doing that in a highly optimised stack is more complex than
136+
expected.
137+
138+
### Accessing all required data
139+
140+
On Linux, all per-packet data are stored in a socket buffer, or
141+
"[SKB](http://oldvger.kernel.org/~davem/skb.html)". In our case here, the eBPF
142+
code needs to access the packet header, which should be available between
143+
`skb->data` and `skb->data_end`. Except that, `skb->data_end` might not point to
144+
the end of the packet, but typically it points to the end of the packet header.
145+
This is an optimisation, because the kernel will often do operations depending
146+
on the packet header, and it doesn't really care about the content of the data,
147+
which is usually more for the userspace, or to be forwarded to another network
148+
interface.
149+
150+
In our case, in egress -- translation from TCP to UDP -- it is fine: the whole
151+
TCP header is available, and that's where the modifications will need to be
152+
done. In ingress -- translation from UDP to TCP -- that's different: some
153+
network drivers will only align data going up to the end of the layer 4
154+
protocol, so the 8 bytes of the UDP header here. This is not enough to do the
155+
translation, as it is required to access the 12 more bytes. This issue is easy
156+
to fix: eBPF helpers were introduced a long time ago to pull in non-linear data,
157+
e.g. via
158+
[`bpf_skb_pull_data`](https://docs.ebpf.io/linux/helper-function/bpf_skb_pull_data/)
159+
or
160+
[`bpf_skb_load_bytes`](https://docs.ebpf.io/linux/helper-function/bpf_skb_load_bytes/).
161+
162+
### GRO & TSO/GSO
163+
164+
On the Internet, packets are usually limited to 1500 bytes or fewer. Each packet
165+
still needs to carry some headers to indicate the source and destination, but
166+
also per-packet information like the data sequence number. Having to deal with
167+
"small" packets has a cost which can be very high to deal with very high
168+
throughput. To counter that, the Linux networking stack will prefer to deal with
169+
bigger chunks of data, with "internal" packets of tens of kilobytes, and split
170+
the packet into smaller ones with very similar header later on. Some network
171+
devices can even do this segmentation or aggregation work in hardware. That's
172+
what GRO (Generic Receive Offload), and TSO (TCP Segmentation Offload) / GSO
173+
(Generic Segmentation Offload) are for.
174+
175+
With TCP-in-UDP, it is required to act on a per-packet basis: each TCP packet
176+
will be translated to UDP, which will contain the UDP header (8 bytes), the rest
177+
of the TCP one (12 bytes + the TCP options), then the TCP payload. In other
178+
words, for each UDP packet, the UDP payload will contain a part of the TCP
179+
header: data that is per-packet specific. It means that the traditional GRO and
180+
TSO cannot be used because the data cannot "simply" be merged with the next one
181+
like before.
182+
183+
Informed readers will then say that these network device features can be easily
184+
disabled using `ethtool`, e.g.
185+
186+
```
187+
ethtool -K "${IFACE}" gro off gso off tso off
188+
```
189+
190+
Correct, but even if all hardware offload accelerations are disabled, in egress,
191+
the Linux networking stack still has interest to deal with bigger packets
192+
internally, and do the segmentation in software at the end. Because it is not
193+
easily possible to modify how the segmentation will be done with eBPF, it is
194+
required to tell the stack not to do this optimisation, e.g. with:
195+
196+
```
197+
ip link set "${IFACE}" gso_max_segs 1
198+
```
199+
200+
### Checksum
201+
202+
> The following was certainly the most frustrating issue to deal with!
203+
204+
Thanks to how the checksum is
205+
[computed](https://datatracker.ietf.org/doc/html/rfc1071), moving some 16-bit
206+
words or bigger around doesn't change the checksum. Still, some fields need to
207+
be updated:
208+
209+
- The layer 4 protocol, set in layer 3 (IPv4/IPv6) here, also used to compute
210+
the next layer (UDP/TCP) checksum.
211+
- The switch from the TCP `Urgent Pointer` (`0`) to the UDP `Length` (and the
212+
opposite).
213+
214+
It is not required to recompute the full checksum. Instead, this can be done
215+
[incrementally](https://datatracker.ietf.org/doc/html/rfc1141), and some eBPF
216+
helpers can do that for us, e.g.
217+
[`bpf_l3_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l3_csum_replace/)
218+
and
219+
[`bpf_l4_csum_replace`](https://docs.ebpf.io/linux/helper-function/bpf_l4_csum_replace/).
220+
221+
When testing with Network namespaces (`netns`) with one host dedicated to the
222+
translation when forwarding packets, everything was fine: the correct checksum
223+
was visible in each packet. But when testing with real hardware, with TCP-in-UDP
224+
eBPF hooks directly on the client and server, that was different: the checksum
225+
in egress was incorrect on most network interfaces, even when the transmission
226+
checksum offload (`tx`) was disabled on the network interface.
227+
228+
After quite a bit of investigation, it appears that both the layer 3 and 4
229+
checksums were correctly updated by the eBPF hook, but either the NIC or the
230+
networking stack was modifying the layer 4 checksum at the wrong place. This
231+
deserves some explanation.
232+
233+
In egress, the Linux TCP networking stack of the sender will typically set
234+
`skb->ip_summed` to `CHECKSUM_PARTIAL`. In short, it means the TCP/IP stack will
235+
compute a part of the checksum, only the one covering the
236+
[pseudo-header](https://www.ietf.org/rfc/rfc9293.html#section-3.1-6.18.1): IP
237+
addresses, protocol number and length. The rest will be computed later on,
238+
ideally by the networking device. At that last stage, the device only needs to
239+
know where the layer 4 starts in the packet, but also where the checksum field
240+
is from the start of this layer 4. This info is internally registered in
241+
`skb->csum_offset`, and it is different for TCP and UDP because the checksum
242+
field is not at the same place in their headers.
243+
244+
When switching from UDP to TCP, it is then not enough to change the protocol
245+
number in the layer 3, this internal checksum offset value also needs to be
246+
updated. If I'm not mistaken, today, it is not possible to update it directly
247+
with eBPF. A proper solution is certainly to add a new eBPF helper, but that
248+
would only work with newer kernels, or eventually with a custom module. Instead,
249+
a workaround has been found: chain the eBPF TC egress hook with a TC `ACT_CSUM`
250+
action when the packet is translated from TCP to UDP. This [`csum`
251+
action](https://www.man7.org/linux/man-pages/man8/tc-csum.8.html) triggers a
252+
software checksum recalculation of the specified packet headers. In other words
253+
and in our case, it is used to compute the rest of the checksum for a given
254+
protocol (UDP), and mark the checksum as computed (`CHECKSUM_NONE`). This last
255+
step is important, because even if it is possible to compute the full checksum
256+
with eBPF code like we did at some point, it is wrong to do so if we cannot
257+
change the `CHECKSUM_PARTIAL` flag which expect a later stage to update a
258+
checksum at a (wrong) offset with the rest of the data.
259+
260+
So with a combination of both TC `ACT_CSUM` and eBPF, it is possible to get the
261+
right checksum after having modified the layer 4 protocol.
262+
263+
### MTU/MSS
264+
265+
This is not linked to the highly optimised Linux network stack, but, on the
266+
wire, the packets will be in UDP and not TCP. It means that some operations like
267+
the dynamic adaptation of the MSS (TCP Maximum Segment Size) -- aka MSS clamping
268+
-- will have no effects here. Many mobile networks uses encapsulation without
269+
jumbo frames, meaning that the maximum size is lower than 1500 bytes. For
270+
performance reasons, and not to have to deal with this, it is important to avoid
271+
IP fragmentation. In other words, it might be required to adapt the interface
272+
Maximum Transmission Unit (MTU), or the
273+
[MTU](https://man.archlinux.org/man/ip-route.8.en#mtu) /
274+
[MSS](https://man.archlinux.org/man/ip-route.8.en#advmss) per destination.
275+
276+
## Conclusion
277+
278+
In conclusion, this new eBPF program can be easily deployed on both the client
279+
and server sides to circumvent middleboxes that are still blocking MPTCP or
280+
other protocols. All you might still need to do is to modify the destination
281+
port which is [currently
282+
hardcoded](https://github.com/multipath-tcp/tcp-in-udp/blob/cde92b1cf8588f7cd3932b204cd51e0596a07ade/tcp_in_udp_tc.c#L29).
283+
284+
## Acknowledgments
285+
286+
Thanks to [Xpedite Technologies](https://xpedite-tech.com) for having supported
287+
this work, and in particular [Chester](https://github.com/arinc9) for his help
288+
investigating the checksum issues with real hardware. Also thanks to Nickz from
289+
the eBPF.io community for his support while working on these checksum issues.

0 commit comments

Comments
 (0)