|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "CHERI Myths: Writing C/C++ for CHERI is hard" |
| 4 | +date: 2024-08-22 |
| 5 | +categories: cheri myths |
| 6 | +author: David Chisnall |
| 7 | +--- |
| 8 | + |
| 9 | +I've had several conversations over the past six months where people who have never written C/C++ code on CHERI have told me that they expect it to be harder than on non-CHERI systems. |
| 10 | +I struggle a bit to understand this. |
| 11 | +If it were true, using tools like [valgrind](https://valgrind.org) and [Address Sanitier](https://github.com/google/sanitizers/wiki/AddressSanitizer) would make development harder, which makes you wonder why these tools exist. |
| 12 | + |
| 13 | +I recently wrote some C and C++ for a non-CHERI target and, honestly, I can't believe I used to do that regularly given how much harder it is. |
| 14 | +Even in environments with a fully working interactive debugger, writing working C/C++ is more effort than on CHERIoT where we don't (yet) have debugger support. |
| 15 | + |
| 16 | +Imagine you have an off-by-one error that overflows a buffer. |
| 17 | +On non-CHERI systems, it's hard to track down. |
| 18 | +On the stack, it may be in padding and have no effect. |
| 19 | +It may have no effect in debug builds, but cause corruption in release builds where the stack layout is different. |
| 20 | +If it's in the heap, it may corrupt some unrelated object and the symptoms show up much later. |
| 21 | +I run in valgrind or address sanitiser and hopefully get a useful result. |
| 22 | + |
| 23 | +On any CHERI target, I get a deterministic fault. |
| 24 | +Every time I read or write that one-byte-out-of-bounds value, I get the same fault. |
| 25 | +On [CheriBSD](https://www.cheribsd.org), I'd attach the debugger and see where it happened. |
| 26 | +On CHERIoT, until we get a working debugger, I'd include the (somewhat poorly named) [`fail-simulator-on-error.h`](https://github.com/CHERIoT-Platform/cheriot-rtos/blob/main/sdk/include/fail-simulator-on-error.h) header, which installs a default error handler. |
| 27 | +When the error is triggered, this prints the exact instruction that tried to read or write out of bounds. |
| 28 | +I'd then look in the dump file, which would tell me the line number, and fix it. |
| 29 | +This typically takes me a minute or two, if that. |
| 30 | + |
| 31 | +Similarly, if I have a use-after-free error, there's some probability that address sanitiser will find it. |
| 32 | +Valgrind is a bit better, but is *very* slow. |
| 33 | +On CHERIoT, I get a trap as soon as I try to use the dangling pointer and I fix it in the same way as a spatial error. |
| 34 | + |
| 35 | +Importantly, the CHERI exception happens *before* any data corruption. |
| 36 | +I'm not trying to work backwards from a point where my heap or stack is corrupted to try to find the place where the corruption occurred, I'm told exactly where the bug is. |
| 37 | +The first use of a dangling pointer or the first out-of-bounds access to an object will trigger a CHERI exception and point to precisely the instruction that is doing the wrong thing. |
| 38 | + |
| 39 | +Note that all of this is about *incorrect* code. |
| 40 | +CHERI C and C++ try very hard to give you a standards-compliant (and de-factor standards-compliant, allowing things that the standard leaves open to implementations but everyone assumes are fine) implementation. |
| 41 | +Almost all of the C and C++ code that we've tried to run on CHERIoT has worked with no source-code modifications. |
| 42 | +Most of these are well-tested codebases, sometimes MISRA C with loads of static analyses run, which *probably* don't have any memory-safety bugs. |
| 43 | + |
| 44 | +The things that cause CHERI traps are undefined behaviour in C/C++. |
| 45 | +When your program does something that is undefined behaviour, the space of possible behaviours is unbounded. |
| 46 | +You may get a segmentation fault. |
| 47 | +You may get arbitrary data corruption. |
| 48 | +You may get a totally unexpected sequence of instructions executed. |
| 49 | +Bugs that introduce undefined behaviour are the *hardest* to debug, because they mean that later code (or, in some exciting examples, [earlier code](https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=633)) is all depending on properties that are not true and so can do absolutely anything. |
| 50 | +Trapping on these things, rather than corrupting state, is a *huge* improvement to the debugging experience. |
| 51 | + |
| 52 | +If you're writing correct code, you probably won't notice the difference between CHERI and non-CHERI systems. |
| 53 | +If you're writing buggy code (which, let's face it, we all do, at least some of the time), CHERI lets you catch errors sooner. |
| 54 | + |
| 55 | +We've heard from several of the companies that prototyped on [Morello](https://www.morello-project.org) that they want to keep their Morello systems for CI for precisely this reason: testing in Morello finds bugs earlier. |
| 56 | + |
| 57 | +The 'shift-left' idea comes from the fact that bugs cost more the later they're found. |
| 58 | +If you can avoid bugs at the design time, that's perfect. |
| 59 | +If you can avoid them before you ship a product, that's good. |
| 60 | +If you can detect them in production and recover, that's okay. |
| 61 | +If you don't detect them and they impact customers, that's the worst (just ask CrowdStrike). |
| 62 | +Developing for a CHERI target makes it easy to find bugs before you ship them. |
| 63 | +It typically costs at least one order of magnitude less to fix them at this point than after deployment. |
| 64 | + |
| 65 | +The 'shift-left' benefits for CHERIoT don't end at catching bugs early. |
| 66 | +If you compartmentalise your software, failures in production can become *recoverable* failures in production. |
| 67 | +For example, the CHERIoT network stack now [restarts the compartment that contains the FreeRTOS TCP/IP stack if it crashes](https://github.com/CHERIoT-Platform/network-stack/pull/27). |
| 68 | +From the perspective of the rest of the system, all connections drop (something that you have to handle anyway because networks are unreliable) and need to be reconnected. |
| 69 | + |
| 70 | +All of this makes developing and shipping products cheaper on CHERI systems than on conventional hardware. |
0 commit comments