|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Maintainers responsibilities" |
| 4 | +--- |
| 5 | + |
| 6 | +Last month, I didn't publish any new blog post here. Not because there was |
| 7 | +nothing to say, but simply because I was busy. Sadly, not 100% focused on the |
| 8 | +new tasks I wish to finish implementing, but mainly focused on resolving issues |
| 9 | +discovered while working on these new features. Should I have closed my eyes and |
| 10 | +carried on? Can maintainers do that? Read on to find out more about what |
| 11 | +happened recently! |
| 12 | + |
| 13 | +<!--more--> |
| 14 | + |
| 15 | +## Maintainers responsibilities |
| 16 | + |
| 17 | +"Maintenance" is a general term which, for a kernel maintainer of an active |
| 18 | +subtree, includes: communication with the community, organizing regular |
| 19 | +meetings, answering questions, tracking, analysing and fixing bugs, fixing |
| 20 | +issues with anything related to the workflow like the CI and other tools and |
| 21 | +services, refactoring code to ease the inclusion of new features or fixes, |
| 22 | +reviewing and accepting work from others, sending modifications to be included |
| 23 | +in the official Linux kernel, helping with the backports, doing the different |
| 24 | +follow-up, and I probably missed other tasks. It might not look like it, but the |
| 25 | +maintenance work in the kernel can be quite time-consuming. Some "small" tasks |
| 26 | +can quickly take a few hours, e.g. reviewing non-straightforward code, or |
| 27 | +analysing bug reports. |
| 28 | + |
| 29 | +I already tried to demonstrate some of these aspects in my previous blog posts. |
| 30 | +Here, I will focus on the responsibilities related to bugs discovered while |
| 31 | +working on new features. |
| 32 | + |
| 33 | +### Discovering new bugs |
| 34 | + |
| 35 | +When bugs are discovered while working on something else, there are typically a |
| 36 | +few possibilities: ignoring, documenting, or fixing them. |
| 37 | + |
| 38 | +- I don't know if it is due to my personality, or because of maintainers' duty, |
| 39 | + but I would feel bad ignoring them without doing anything else. When someone |
| 40 | + is new to a project, it might not be clear if something looking strange is |
| 41 | + really a bug or not. But if it is someone who maintains the code, it is |
| 42 | + clearer when something is not right. It is then hard not to think about the |
| 43 | + consequences in the mid or long term, and ignore issues that will come back |
| 44 | + sooner or later, with possibly more pressure, or bad consequences. |
| 45 | + |
| 46 | +- Documenting the issue can be a "quick" solution. Even if sometimes, |
| 47 | + documenting issues can take almost as long as resolving them: the focus will |
| 48 | + be on the issue, it is normal to already think about solutions, then why not |
| 49 | + trying to fix it while everything is still "fresh" in mind. But sometimes, |
| 50 | + there are some urgencies, the bug resolution can be long, or the priority |
| 51 | + can be too low. |
| 52 | + |
| 53 | +- Fixing bugs would be ideal. But fixing bugs also means understanding them by |
| 54 | + analysing code, reproducing them by adding a regression test, and documenting |
| 55 | + them by providing all required details in commit messages. Often in such |
| 56 | + projects, this is not done in 5 minutes. |
| 57 | + |
| 58 | +### Recent examples |
| 59 | + |
| 60 | +Recently, I was working on documenting how the [MPTCP default |
| 61 | +Path-Manager](https://www.mptcp.dev/pm.html) is working, and improving the user |
| 62 | +experience. It is clear to me, I could not simply ignore issues I found while |
| 63 | +working on that. But also, documenting that "_something is supposed to work like |
| 64 | +that, but don't in some cases_" was feeling wrong. |
| 65 | + |
| 66 | +I then look at the first bug, then, as often in these cases, it was like opening |
| 67 | +Pandora's box: one bug after another. The result was the creation of 30+ kernel |
| 68 | +patches, a better documentation, resolving a few issues reported by users but |
| 69 | +not understood at that time with the provided info, etc. But also a clearer, and |
| 70 | +more predictable software behaviour, which improves the user experience at the |
| 71 | +end. |
| 72 | + |
| 73 | +Here are two other examples with tests suites. The first one is with the MPTCP |
| 74 | +CI which [reported](https://ci-results.mptcp.dev/flakes.html) a few unstable |
| 75 | +tests over the last few months. They all used the same tools, and even if the |
| 76 | +errors were quite rare, they were happening with different tests. Because of |
| 77 | +that, developers started to lose faith in them: in case of error, it is no |
| 78 | +longer a sign of an issue with the new code. After a bit of time, developers |
| 79 | +might even not look at the errors any more, blaming the tests instead, and |
| 80 | +possibly missing real problems. A short term solution was to re-launch the |
| 81 | +tests, and consider them as problematic if they were failing twice in a row. |
| 82 | +That can be OK to do that in some specific cases, but it also means, real |
| 83 | +issues that only happened in some conditions might be missed as well. Fixing the |
| 84 | +root cause seems more rewarding, and better in the long term. That's what has |
| 85 | +been [done](https://github.com/multipath-tcp/packetdrill/pulls?q=is%3Apr+is%3Aclosed) |
| 86 | +recently with MPTCP Packetdrill tests. It is good to have a trusted test suite! |
| 87 | + |
| 88 | +In the second example, another CI, the [Netdev |
| 89 | +one](https://netdev.bots.linux.dev/status.html), reported that some specific |
| 90 | +subtests were unstable. They have been unstable only there, probably because too |
| 91 | +many tests are being executed at the same time. The issues have been tracked and |
| 92 | +documented, they still need further investigation, but it looks like it is |
| 93 | +either an issue with the test itself, or the fixes seem more like non-trivial |
| 94 | +optimizations. So either a debatable low priority, or an important work. In such |
| 95 | +cases, it has been decided to clearly mark these tests as "unstable", and not as |
| 96 | +"error". By doing that, it "_reduces the noise_", and helps new developers not |
| 97 | +understanding why their modifications caused some unrelated issues. Yet, it is |
| 98 | +still important to only mark the ones that had a first analysis, and track their |
| 99 | +evolution. That's what has been [done](https://lore.kernel.org/mptcp/20240524-upstream-net-20240524-selftests-mptcp-flaky-v1-0-a352362f3f8e@kernel.org/) |
| 100 | +recently with MPTCP selftests. |
| 101 | + |
| 102 | +## Team work |
| 103 | + |
| 104 | +As always, it is important to note that what I presented here so far is mostly |
| 105 | +what I was working on. But I'm not alone in this project. For example, Geliang |
| 106 | +helped to reduce duplicated code in BPF selftests, including the MPTCP ones ; |
| 107 | +Davide replaced a few unintentionally discriminated words from the comments in |
| 108 | +the code ; Yonglong Li fixed bugs with MIB counters ; Gregory continued his |
| 109 | +experimentations with the packet scheduler API ; Paolo and Mat helped with the |
| 110 | +code reviews ; Christoph continued the SyzKaller infrastructure maintenance. |
| 111 | + |
| 112 | +A great community! |
0 commit comments