Skip to content

Commit 06fa815

Browse files
committed
post: new article about maintainers' responsibilities
Trying to explain why things take time when it is done by the maintainers who have responsibilities, and cannot drop everything for a long period, and only focus on new features. Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
1 parent f731018 commit 06fa815

File tree

1 file changed

+112
-0
lines changed

1 file changed

+112
-0
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
layout: post
3+
title: "Maintainers responsibilities"
4+
---
5+
6+
Last month, I didn't publish any new blog post here. Not because there was
7+
nothing to say, but simply because I was busy. Sadly, not 100% focused on the
8+
new tasks I wish to finish implementing, but mainly focused on resolving issues
9+
discovered while working on these new features. Should I have closed my eyes and
10+
carried on? Can maintainers do that? Read on to find out more about what
11+
happened recently!
12+
13+
<!--more-->
14+
15+
## Maintainers responsibilities
16+
17+
"Maintenance" is a general term which, for a kernel maintainer of an active
18+
subtree, includes: communication with the community, organizing regular
19+
meetings, answering questions, tracking, analysing and fixing bugs, fixing
20+
issues with anything related to the workflow like the CI and other tools and
21+
services, refactoring code to ease the inclusion of new features or fixes,
22+
reviewing and accepting work from others, sending modifications to be included
23+
in the official Linux kernel, helping with the backports, doing the different
24+
follow-up, and I probably missed other tasks. It might not look like it, but the
25+
maintenance work in the kernel can be quite time-consuming. Some "small" tasks
26+
can quickly take a few hours, e.g. reviewing non-straightforward code, or
27+
analysing bug reports.
28+
29+
I already tried to demonstrate some of these aspects in my previous blog posts.
30+
Here, I will focus on the responsibilities related to bugs discovered while
31+
working on new features.
32+
33+
### Discovering new bugs
34+
35+
When bugs are discovered while working on something else, there are typically a
36+
few possibilities: ignoring, documenting, or fixing them.
37+
38+
- I don't know if it is due to my personality, or because of maintainers' duty,
39+
but I would feel bad ignoring them without doing anything else. When someone
40+
is new to a project, it might not be clear if something looking strange is
41+
really a bug or not. But if it is someone who maintains the code, it is
42+
clearer when something is not right. It is then hard not to think about the
43+
consequences in the mid or long term, and ignore issues that will come back
44+
sooner or later, with possibly more pressure, or bad consequences.
45+
46+
- Documenting the issue can be a "quick" solution. Even if sometimes,
47+
documenting issues can take almost as long as resolving them: the focus will
48+
be on the issue, it is normal to already think about solutions, then why not
49+
trying to fix it while everything is still "fresh" in mind. But sometimes,
50+
there are some urgencies, the bug resolution can be long, or the priority
51+
can be too low.
52+
53+
- Fixing bugs would be ideal. But fixing bugs also means understanding them by
54+
analysing code, reproducing them by adding a regression test, and documenting
55+
them by providing all required details in commit messages. Often in such
56+
projects, this is not done in 5 minutes.
57+
58+
### Recent examples
59+
60+
Recently, I was working on documenting how the [MPTCP default
61+
Path-Manager](https://www.mptcp.dev/pm.html) is working, and improving the user
62+
experience. It is clear to me, I could not simply ignore issues I found while
63+
working on that. But also, documenting that "_something is supposed to work like
64+
that, but don't in some cases_" was feeling wrong.
65+
66+
I then look at the first bug, then, as often in these cases, it was like opening
67+
Pandora's box: one bug after another. The result was the creation of 30+ kernel
68+
patches, a better documentation, resolving a few issues reported by users but
69+
not understood at that time with the provided info, etc. But also a clearer, and
70+
more predictable software behaviour, which improves the user experience at the
71+
end.
72+
73+
Here are two other examples with tests suites. The first one is with the MPTCP
74+
CI which [reported](https://ci-results.mptcp.dev/flakes.html) a few unstable
75+
tests over the last few months. They all used the same tools, and even if the
76+
errors were quite rare, they were happening with different tests. Because of
77+
that, developers started to lose faith in them: in case of error, it is no
78+
longer a sign of an issue with the new code. After a bit of time, developers
79+
might even not look at the errors any more, blaming the tests instead, and
80+
possibly missing real problems. A short term solution was to re-launch the
81+
tests, and consider them as problematic if they were failing twice in a row.
82+
That can be OK to do that in some specific cases, but it also means, real
83+
issues that only happened in some conditions might be missed as well. Fixing the
84+
root cause seems more rewarding, and better in the long term. That's what has
85+
been [done](https://github.com/multipath-tcp/packetdrill/pulls?q=is%3Apr+is%3Aclosed)
86+
recently with MPTCP Packetdrill tests. It is good to have a trusted test suite!
87+
88+
In the second example, another CI, the [Netdev
89+
one](https://netdev.bots.linux.dev/status.html), reported that some specific
90+
subtests were unstable. They have been unstable only there, probably because too
91+
many tests are being executed at the same time. The issues have been tracked and
92+
documented, they still need further investigation, but it looks like it is
93+
either an issue with the test itself, or the fixes seem more like non-trivial
94+
optimizations. So either a debatable low priority, or an important work. In such
95+
cases, it has been decided to clearly mark these tests as "unstable", and not as
96+
"error". By doing that, it "_reduces the noise_", and helps new developers not
97+
understanding why their modifications caused some unrelated issues. Yet, it is
98+
still important to only mark the ones that had a first analysis, and track their
99+
evolution. That's what has been [done](https://lore.kernel.org/mptcp/20240524-upstream-net-20240524-selftests-mptcp-flaky-v1-0-a352362f3f8e@kernel.org/)
100+
recently with MPTCP selftests.
101+
102+
## Team work
103+
104+
As always, it is important to note that what I presented here so far is mostly
105+
what I was working on. But I'm not alone in this project. For example, Geliang
106+
helped to reduce duplicated code in BPF selftests, including the MPTCP ones ;
107+
Davide replaced a few unintentionally discriminated words from the comments in
108+
the code ; Yonglong Li fixed bugs with MIB counters ; Gregory continued his
109+
experimentations with the packet scheduler API ; Paolo and Mat helped with the
110+
code reviews ; Christoph continued the SyzKaller infrastructure maintenance.
111+
112+
A great community!

0 commit comments

Comments
 (0)