-
Notifications
You must be signed in to change notification settings - Fork 206
psc_seq: rectifier ereports #2214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
HEY CHECK THIS OUT: eliza@niles ~ $ pilot sp -i axf8 exec -e 'ereports' BRM11230004
Sep 03 21:55:07.538 INFO creating SP handle on interface axf8, component: faux-mgs
Sep 03 21:55:07.540 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe06:119%28]:11111, interface: axf8, socket: control-plane-agent, component: faux-mgs
restart ID: 9a18e261-6284-16b9-26e9-1c1fc987d1e3
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 3
ereports:
0x1: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"ereport_message_version": Number(0),
"hubris_archive_id": String("h-RcgwEqyx4"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"),
"hubris_uptime_ms": Number(0),
"lost": Null,
}
0x2: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU1"),
"ereport_message_version": Number(0),
"hubris_archive_id": String("h-RcgwEqyx4"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(27287),
"k": String("psu.remove"),
"psu": Object {
"serial": String("LL2216RB0016"),
"slot": Number(1),
},
"v": Number(0),
}
0x3: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU1"),
"ereport_message_version": Number(0),
"hubris_archive_id": String("h-RcgwEqyx4"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(39542),
"k": String("psu.insert"),
"psu": Object {
"serial": String("LL2216RB0016"),
"slot": Number(1),
},
"v": Number(0),
} |
and here's what I see now when @ericaasen moves the rectifier in slot 1 to slot 4, and then causes an undervolt fault by removing input power on slot 2: eliza@niles ~ $ pilot sp -i axf8 exec -e 'ereports' BRM11230004
Sep 04 18:55:03.004 INFO creating SP handle on interface axf8, component: faux-mgs
Sep 04 18:55:03.006 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe06:119%28]:11111, interface: axf8, socket: control-plane-agent, component: faux-mgs
restart ID: f919674f-7485-8845-bb9e-9d1246653f03
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 6
ereports:
0x1: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"ereport_message_version": Number(0),
"hubris_archive_id": String("7Hf_wLsmHG8"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("packrat"),
"hubris_uptime_ms": Number(0),
"lost": Null,
}
0x2: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU1"),
"ereport_message_version": Number(0),
"fruid": Object {
"fw_rev": String("0762"),
"mfr": String("Murata-PS"),
"mpn": String("MWOCP68-3600-D-RM"),
"serial": String("LL2216RB0016"),
},
"hubris_archive_id": String("7Hf_wLsmHG8"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(2311241),
"k": String("psu.remove"),
"psu_slot": Number(1),
"v": Number(0),
}
0x3: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU4"),
"ereport_message_version": Number(0),
"fruid": Object {
"fw_rev": String("0762"),
"mfr": String("Murata-PS"),
"mpn": String("MWOCP68-3600-D-RM"),
"serial": String("LL2216RB0016"),
},
"hubris_archive_id": String("7Hf_wLsmHG8"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(2325199),
"k": String("psu.insert"),
"psu_slot": Number(4),
"v": Number(0),
}
0x4: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU2"),
"ereport_message_version": Number(0),
"fruid": Object {
"fw_rev": String("0762"),
"mfr": String("Murata-PS"),
"mpn": String("MWOCP68-3600-D-RM"),
"serial": String("LL2420RF000D"),
},
"hubris_archive_id": String("7Hf_wLsmHG8"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(2346517),
"k": String("psu.fault"),
"pmbus_status": Object {
"cml": Number(0),
"input": Number(0),
"iout": Number(0),
"mfr": Number(0),
"temp": Number(0),
"vout": Number(0),
"word": Number(8200),
},
"psu_slot": Number(2),
"v": Number(0),
}
0x5: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU2"),
"ereport_message_version": Number(0),
"fruid": Object {
"fw_rev": String("0762"),
"mfr": String("Murata-PS"),
"mpn": String("MWOCP68-3600-D-RM"),
"serial": String("LL2420RF000D"),
},
"hubris_archive_id": String("7Hf_wLsmHG8"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(2352516),
"k": String("psu.fault_cleared"),
"pmbus_status": Object {
"cml": Number(0),
"input": Number(0),
"iout": Number(0),
"mfr": Number(0),
"temp": Number(0),
"vout": Number(0),
"word": Number(10312),
},
"psu_slot": Number(2),
"v": Number(0),
}
0x6: {
"baseboard_part_number": String("913-0000003"),
"baseboard_rev": Number(9),
"baseboard_serial_number": String("BRM11230004"),
"dev_id": String("PSU2"),
"ereport_message_version": Number(0),
"fruid": Object {
"fw_rev": String("0762"),
"mfr": String("Murata-PS"),
"mpn": String("MWOCP68-3600-D-RM"),
"serial": String("LL2420RF000D"),
},
"hubris_archive_id": String("7Hf_wLsmHG8"),
"hubris_task_gen": Number(0),
"hubris_task_name": String("sequencer"),
"hubris_uptime_ms": Number(2352711),
"k": String("psu.fault"),
"pmbus_status": Object {
"cml": Number(0),
"input": Number(0),
"iout": Number(0),
"mfr": Number(0),
"temp": Number(0),
"vout": Number(0),
"word": Number(10312),
},
"psu_slot": Number(2),
"v": Number(0),
}
eliza@niles ~ $ Note that we have all the FRUID fields from PMbus now, as per @rmustacc's request. :) I think the fault-related ereports are not quite right: we see a "fault_cleared" event when the PSC tries to clear the fault by re-enabling the rectifier, but the fault does not clear, and then we see a subsequent ereport. That's not great --- |
e2bf2d1
to
005943e
Compare
The I'm hoping that's transient, I'm gonna restart it and see what happens. |
HRRRM I expect that should be transient? I'd have to go and re-look at what this test is actually doing. |
It appears to have transited: https://github.com/oxidecomputer/hubris/pull/2214/checks?check_run_id=49871665979 :) |
@cbiffle I think I've addressed all your suggestions, thanks! |
In [this comment][1], @cbiffle reminded me that we can use `ClaimOnceCell` to use a `static` to create byte buffers for encoding ereports. This way, the buffer is never on the stack, reducing the risk of stack overflows when recording an ereport. This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server` to use that technique for their ereports. [1]: #2214 (comment)
In [this comment][1], @cbiffle reminded me that we can use `ClaimOnceCell` to use a `static` to create byte buffers for encoding ereports. This way, the buffer is never on the stack, reducing the risk of stack overflows when recording an ereport. This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server` to use that technique for their ereports. [1]: #2214 (comment)
In [this comment][1], @cbiffle reminded me that we can use `ClaimOnceCell` to use a `static` to create byte buffers for encoding ereports. This way, the buffer is never on the stack, reducing the risk of stack overflows when recording an ereport. This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server` to use that technique for their ereports. [1]: #2214 (comment)
In [this comment][1], @cbiffle reminded me that we can use `ClaimOnceCell` to use a `static` to create byte buffers for encoding ereports. This way, the buffer is never on the stack, reducing the risk of stack overflows when recording an ereport. This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server` to use that technique for their ereports. [1]: #2214 (comment)
7331a67
to
6baa38b
Compare
ringbuf!((u64, Trace), 128, (0, Trace::Empty)); | ||
// Since entries in this ringbuffer contain timestamps, they will never be | ||
// de-duplicated. Thus, disable it. | ||
counted_ringbuf!(Event, 128, Event::None, no_dedup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should this be __EVENT
for consistency with the default __RINGBUF
name (and __TRACE
below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had wanted one of them to be the default for brevity's sake, but if you really don't like it, I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh whoops, I mis-parsed the macro as creating a ringbuf named Event
. Seems fine as-is!
@@ -551,32 +829,124 @@ impl Psu { | |||
_, | |||
_, | |||
) => { | |||
// Hello, who are you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only refresh the fruid
after settle_deadline
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♀️ On one hand, the I2C reads are likelier to fail if the contacts are still scraping, on the other, it gives us a longer opportunity to get good reads for every field before we make an ereport to say "hey, this thing was plugged in". We start trying to do it above in the "presence state changed" case, so it seems good to keep trying here regardless of whether we start caring about power-good-ness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I don't feel strongly about it
Co-authored-by: Matt Keeter <[email protected]>
Co-authored-by: Matt Keeter <[email protected]>
This branch adds ereports to the
psc_seq
task for major PSU events. This is the first step in implementing the "Rectifier Removal" and "Upstream Power Loss" fault-management flows that @rmustacc described in RFD 589.In particular, we generate ereports for the following events:
psu.insert
: reported when a rectifier is inserted (a previously-deasserted presence pin is asserted)psu.remove
: reported when a previously present rectifier is removed (a previously-asserted presence pin is deasserted)psu.fault
: reported when a rectifier deassertsPOWER_GOOD
psu.fault_cleared
: reported when a rectifier that has been restarted due to a fault comes back and reassertsPOWER_GOOD
All these ereports contain a
fruid
key, which is an object containing the following keys:mfr
: the manufacturer part number of the rectifier read over PMBus (in practice this is always the string "Murata-PS")mpn
: the manufacturer part number of the rectifier read over PMBus (in practice, always the string "MWOCP68-3600-D-RM")serial
: the manufacturer serial of the rectifier, read over PMBusfw_rev
: the manufacturer firmware revision of the rectifier, read over PMbusIn order to also record FRUID when a rectifier is removed, the PSC sequencer has to hang onto it in memory, rather than reading it from the rectifier on-demand in when the ereport is recorded.
In addition, the
psu.fault
andpsu.fault_cleared
ereports also record the PMBus status registers of the rectifier, which the control plane can use to diagnose the fault.With help from @ericaasen, I've tested this change by removing and replacing rectifiers, and by removing input power from a rectifier (and eventually bringing it back). All these events cause the expected sequence of ereports to be generated.
This branch does not implement a number of other things we'll need to do in order to fully implement the power shelf fault-management flows from RFD 589. In particular, things I didn't do here include:
PMALERT_L
pins from the rectifiers. This change only detects faults by reading thePOWER_GOOD
pins. In practice, this seems sufficient to detect PMBus faults, as the rectifier will deassert itsPOWER_GOOD
when faulting. However, if we want to also receive PMBus warnings, where the rectifier does not deassertPOWER_GOOD
, we'll also need to watchPMALERT
. Presently, there won't be any such warnings, as we haven't set thresholds for warnings. But, we might want that later.control-plane-agent
. In order for the control plane to poll the state of the power shelf (such as in the event of ereport data loss), we'll also need health endpoints for the rectifiers, as described in RFD 589 § 1.6. That'll be done in a subsequent PR. See expose PSC sequencer's understanding of rectifier states to the management network #2216.control-plane-agent
's inventory API. See control-plane-agent inventory should include FRUID VPD #2212.TODO
before, and it still is one now. We'll definitely want to do this in a future change, though.Closes #2208
Closes #2143