Skip to content

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Sep 3, 2025

This branch adds ereports to the psc_seq task for major PSU events. This is the first step in implementing the "Rectifier Removal" and "Upstream Power Loss" fault-management flows that @rmustacc described in RFD 589.

In particular, we generate ereports for the following events:

  • psu.insert: reported when a rectifier is inserted (a previously-deasserted presence pin is asserted)
  • psu.remove: reported when a previously present rectifier is removed (a previously-asserted presence pin is deasserted)
  • psu.fault: reported when a rectifier deasserts POWER_GOOD
  • psu.fault_cleared : reported when a rectifier that has been restarted due to a fault comes back and reasserts POWER_GOOD

All these ereports contain a fruid key, which is an object containing the following keys:

  • mfr: the manufacturer part number of the rectifier read over PMBus (in practice this is always the string "Murata-PS")
  • mpn: the manufacturer part number of the rectifier read over PMBus (in practice, always the string "MWOCP68-3600-D-RM")
  • serial: the manufacturer serial of the rectifier, read over PMBus
  • fw_rev: the manufacturer firmware revision of the rectifier, read over PMbus

In order to also record FRUID when a rectifier is removed, the PSC sequencer has to hang onto it in memory, rather than reading it from the rectifier on-demand in when the ereport is recorded.

In addition, the psu.fault and psu.fault_cleared ereports also record the PMBus status registers of the rectifier, which the control plane can use to diagnose the fault.

With help from @ericaasen, I've tested this change by removing and replacing rectifiers, and by removing input power from a rectifier (and eventually bringing it back). All these events cause the expected sequence of ereports to be generated.

This branch does not implement a number of other things we'll need to do in order to fully implement the power shelf fault-management flows from RFD 589. In particular, things I didn't do here include:

  • Monitoring of PMBus PMALERT_L pins from the rectifiers. This change only detects faults by reading the POWER_GOOD pins. In practice, this seems sufficient to detect PMBus faults, as the rectifier will deassert its POWER_GOOD when faulting. However, if we want to also receive PMBus warnings, where the rectifier does not deassert POWER_GOOD, we'll also need to watch PMALERT. Presently, there won't be any such warnings, as we haven't set thresholds for warnings. But, we might want that later.
  • A mechanism for the control plane to read whether a rectifier is enabled/present. Right now, we only generate ereports, and don't expose our state over control-plane-agent. In order for the control plane to poll the state of the power shelf (such as in the event of ereport data loss), we'll also need health endpoints for the rectifiers, as described in RFD 589 § 1.6. That'll be done in a subsequent PR. See expose PSC sequencer's understanding of rectifier states to the management network #2216.
  • A way to read rectifier FRUID over the management network. Again, we only include FRUID VPD for rectifiers in ereports, and don't expose it in control-plane-agent's inventory API. See control-plane-agent inventory should include FRUID VPD #2212.
  • More sophisticated fault response based on the PSC sequencer's understanding of PMBus status bits. Lol. Nope. The behavior of the sequencer with regards to enabling/disabling rectifiers and taking other actions in response to faults has not changed here. All I've added is ereports.
  • Reading rectifier black-boxes and exfiltrating them over the network. That was a TODO before, and it still is one now. We'll definitely want to do this in a future change, though.

Closes #2208
Closes #2143

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2025

HEY CHECK THIS OUT:

eliza@niles ~ $ pilot sp -i axf8 exec -e 'ereports' BRM11230004
Sep 03 21:55:07.538 INFO creating SP handle on interface axf8, component: faux-mgs
Sep 03 21:55:07.540 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe06:119%28]:11111, interface: axf8, socket: control-plane-agent, component: faux-mgs
restart ID: 9a18e261-6284-16b9-26e9-1c1fc987d1e3
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 3

ereports:
0x1: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("h-RcgwEqyx4"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU1"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("h-RcgwEqyx4"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(27287),
    "k": String("psu.remove"),
    "psu": Object {
        "serial": String("LL2216RB0016"),
        "slot": Number(1),
    },
    "v": Number(0),
}

0x3: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU1"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("h-RcgwEqyx4"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(39542),
    "k": String("psu.insert"),
    "psu": Object {
        "serial": String("LL2216RB0016"),
        "slot": Number(1),
    },
    "v": Number(0),
}

@hawkw
Copy link
Member Author

hawkw commented Sep 4, 2025

and here's what I see now when @ericaasen moves the rectifier in slot 1 to slot 4, and then causes an undervolt fault by removing input power on slot 2:

eliza@niles ~ $ pilot sp -i axf8 exec -e 'ereports' BRM11230004
Sep 04 18:55:03.004 INFO creating SP handle on interface axf8, component: faux-mgs
Sep 04 18:55:03.006 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe06:119%28]:11111, interface: axf8, socket: control-plane-agent, component: faux-mgs
restart ID: f919674f-7485-8845-bb9e-9d1246653f03
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 6

ereports:
0x1: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "ereport_message_version": Number(0),
    "hubris_archive_id": String("7Hf_wLsmHG8"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU1"),
    "ereport_message_version": Number(0),
    "fruid": Object {
        "fw_rev": String("0762"),
        "mfr": String("Murata-PS"),
        "mpn": String("MWOCP68-3600-D-RM"),
        "serial": String("LL2216RB0016"),
    },
    "hubris_archive_id": String("7Hf_wLsmHG8"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(2311241),
    "k": String("psu.remove"),
    "psu_slot": Number(1),
    "v": Number(0),
}

0x3: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU4"),
    "ereport_message_version": Number(0),
    "fruid": Object {
        "fw_rev": String("0762"),
        "mfr": String("Murata-PS"),
        "mpn": String("MWOCP68-3600-D-RM"),
        "serial": String("LL2216RB0016"),
    },
    "hubris_archive_id": String("7Hf_wLsmHG8"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(2325199),
    "k": String("psu.insert"),
    "psu_slot": Number(4),
    "v": Number(0),
}

0x4: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU2"),
    "ereport_message_version": Number(0),
    "fruid": Object {
        "fw_rev": String("0762"),
        "mfr": String("Murata-PS"),
        "mpn": String("MWOCP68-3600-D-RM"),
        "serial": String("LL2420RF000D"),
    },
    "hubris_archive_id": String("7Hf_wLsmHG8"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(2346517),
    "k": String("psu.fault"),
    "pmbus_status": Object {
        "cml": Number(0),
        "input": Number(0),
        "iout": Number(0),
        "mfr": Number(0),
        "temp": Number(0),
        "vout": Number(0),
        "word": Number(8200),
    },
    "psu_slot": Number(2),
    "v": Number(0),
}

0x5: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU2"),
    "ereport_message_version": Number(0),
    "fruid": Object {
        "fw_rev": String("0762"),
        "mfr": String("Murata-PS"),
        "mpn": String("MWOCP68-3600-D-RM"),
        "serial": String("LL2420RF000D"),
    },
    "hubris_archive_id": String("7Hf_wLsmHG8"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(2352516),
    "k": String("psu.fault_cleared"),
    "pmbus_status": Object {
        "cml": Number(0),
        "input": Number(0),
        "iout": Number(0),
        "mfr": Number(0),
        "temp": Number(0),
        "vout": Number(0),
        "word": Number(10312),
    },
    "psu_slot": Number(2),
    "v": Number(0),
}

0x6: {
    "baseboard_part_number": String("913-0000003"),
    "baseboard_rev": Number(9),
    "baseboard_serial_number": String("BRM11230004"),
    "dev_id": String("PSU2"),
    "ereport_message_version": Number(0),
    "fruid": Object {
        "fw_rev": String("0762"),
        "mfr": String("Murata-PS"),
        "mpn": String("MWOCP68-3600-D-RM"),
        "serial": String("LL2420RF000D"),
    },
    "hubris_archive_id": String("7Hf_wLsmHG8"),
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("sequencer"),
    "hubris_uptime_ms": Number(2352711),
    "k": String("psu.fault"),
    "pmbus_status": Object {
        "cml": Number(0),
        "input": Number(0),
        "iout": Number(0),
        "mfr": Number(0),
        "temp": Number(0),
        "vout": Number(0),
        "word": Number(10312),
    },
    "psu_slot": Number(2),
    "v": Number(0),
}

eliza@niles ~ $

Note that we have all the FRUID fields from PMbus now, as per @rmustacc's request. :)

I think the fault-related ereports are not quite right: we see a "fault_cleared" event when the PSC tries to clear the fault by re-enabling the rectifier, but the fault does not clear, and then we see a subsequent ereport. That's not great --- fault_cleared should mean "we turned it off and back on again and when we did that itcame back okay", and not just "we turned it off and back on again and then immediately saw it fault again". We should track that better --- will fix.

@hawkw hawkw changed the title [WIP] psc-seq: rectifier ereports psc-seq: rectifier ereports Sep 4, 2025
@hawkw hawkw changed the title psc-seq: rectifier ereports psc_seq: rectifier ereports Sep 4, 2025
@hawkw hawkw requested review from mkeeter and cbiffle September 4, 2025 20:24
@hawkw hawkw marked this pull request as ready for review September 4, 2025 20:24
@hawkw hawkw added service processor Related to the service processor. psc Related to the power shelf controller ⚠️ ereport if you see something, say something! labels Sep 4, 2025
@hawkw hawkw self-assigned this Sep 4, 2025
@hawkw hawkw force-pushed the eliza/psc-seq-ereport branch from e2bf2d1 to 005943e Compare September 5, 2025 19:31
@hawkw
Copy link
Member Author

hawkw commented Sep 8, 2025

The sign sp-1 Buildomat job failure for 005943e doesn't look like my fault; it seems something had already bound a port that something else was trying to use: https://buildomat.eng.oxide.computer/wg/0/details/01K4DNRWKQWD80Z883FQY9QHP7/DQ0VjZXxYtUR7MD0jlEMRHZY2iQLZc6uggyzijlnsFr5lbIn/01K4DNSJE0PKZWV6VQ5WMS95EA#S1739

I'm hoping that's transient, I'm gonna restart it and see what happens.

@labbott
Copy link
Collaborator

labbott commented Sep 8, 2025

The sign sp-1 Buildomat job failure for 005943e doesn't look like my fault; it seems something had already bound a port that something else was trying to use: https://buildomat.eng.oxide.computer/wg/0/details/01K4DNRWKQWD80Z883FQY9QHP7/DQ0VjZXxYtUR7MD0jlEMRHZY2iQLZc6uggyzijlnsFr5lbIn/01K4DNSJE0PKZWV6VQ5WMS95EA#S1739

I'm hoping that's transient, I'm gonna restart it and see what happens.

HRRRM I expect that should be transient? I'd have to go and re-look at what this test is actually doing.

@hawkw
Copy link
Member Author

hawkw commented Sep 8, 2025

The sign sp-1 Buildomat job failure for 005943e doesn't look like my fault; it seems something had already bound a port that something else was trying to use: https://buildomat.eng.oxide.computer/wg/0/details/01K4DNRWKQWD80Z883FQY9QHP7/DQ0VjZXxYtUR7MD0jlEMRHZY2iQLZc6uggyzijlnsFr5lbIn/01K4DNSJE0PKZWV6VQ5WMS95EA#S1739
I'm hoping that's transient, I'm gonna restart it and see what happens.

HRRRM I expect that should be transient? I'd have to go and re-look at what this test is actually doing.

It appears to have transited: https://github.com/oxidecomputer/hubris/pull/2214/checks?check_run_id=49871665979 :)

@hawkw
Copy link
Member Author

hawkw commented Sep 11, 2025

@cbiffle I think I've addressed all your suggestions, thanks!

@hawkw hawkw requested a review from cbiffle September 11, 2025 17:53
hawkw added a commit that referenced this pull request Sep 15, 2025
In [this comment][1], @cbiffle reminded me that we can use
`ClaimOnceCell` to use a `static` to create byte buffers for encoding
ereports. This way, the buffer is never on the stack, reducing the risk
of stack overflows when recording an ereport.

This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server`
to use that technique for their ereports.

[1]: #2214 (comment)
hawkw added a commit that referenced this pull request Sep 15, 2025
In [this comment][1], @cbiffle reminded me that we can use
`ClaimOnceCell` to use a `static` to create byte buffers for encoding
ereports. This way, the buffer is never on the stack, reducing the risk
of stack overflows when recording an ereport.

This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server`
to use that technique for their ereports.

[1]: #2214 (comment)
hawkw added a commit that referenced this pull request Sep 15, 2025
In [this comment][1], @cbiffle reminded me that we can use
`ClaimOnceCell` to use a `static` to create byte buffers for encoding
ereports. This way, the buffer is never on the stack, reducing the risk
of stack overflows when recording an ereport.

This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server`
to use that technique for their ereports.

[1]: #2214 (comment)
hawkw added a commit that referenced this pull request Sep 15, 2025
In [this comment][1], @cbiffle reminded me that we can use
`ClaimOnceCell` to use a `static` to create byte buffers for encoding
ereports. This way, the buffer is never on the stack, reducing the risk
of stack overflows when recording an ereport.

This commit changes `drv_gimlet_seq_server` and `drv_cosmo_seq_server`
to use that technique for their ereports.

[1]: #2214 (comment)
@hawkw hawkw force-pushed the eliza/psc-seq-ereport branch from 7331a67 to 6baa38b Compare September 15, 2025 19:47
ringbuf!((u64, Trace), 128, (0, Trace::Empty));
// Since entries in this ringbuffer contain timestamps, they will never be
// de-duplicated. Thus, disable it.
counted_ringbuf!(Event, 128, Event::None, no_dedup);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this be __EVENT for consistency with the default __RINGBUF name (and __TRACE below)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wanted one of them to be the default for brevity's sake, but if you really don't like it, I can change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops, I mis-parsed the macro as creating a ringbuf named Event. Seems fine as-is!

@@ -551,32 +829,124 @@ impl Psu {
_,
_,
) => {
// Hello, who are you?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only refresh the fruid after settle_deadline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♀️ On one hand, the I2C reads are likelier to fail if the contacts are still scraping, on the other, it gives us a longer opportunity to get good reads for every field before we make an ereport to say "hey, this thing was plugged in". We start trying to do it above in the "presence state changed" case, so it seems good to keep trying here regardless of whether we start caring about power-good-ness?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don't feel strongly about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚠️ ereport if you see something, say something! psc Related to the power shelf controller service processor Related to the service processor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ereport: PSC input undervolt/undercurrent ereport: power shelf rectifier fault
4 participants