Skip to content

Conversation

faizana-nvidia
Copy link
Contributor

@faizana-nvidia faizana-nvidia commented Jul 2, 2025

MCTP Bridge Endpoint poll

Reference : DSP0236 section 8.17.6 Reclaiming EIDs from hot-plug devices

The bus owner/bridge can detect a removed device or devices by validating the EIDs that are presently allocated to endpoints that are directly on the bus and identifying which EIDs are missing. It can do this by attempting to access each endpoint that the bridge has listed in its routing table as being a device that is directly on the particular bus. Attempting to access each endpoint can be accomplished by issuing the Get Endpoint ID command.

This PR aims to extend mctpd as bus owner to enhance bridge support feature by incorporating downstream endpoint polling.
One way to detect the accessibility of its downstream devices is for bus owner/bridge to probes on each bridge's downstream endpoint using the Get Endpoint ID command.

Once its established that endpoint path is accessible, we would stop polling to that eid and publish the eid endpoint object

This PR is based discussions over MCTP Bridge Support : #71

@jk-ozlabs
Copy link
Member

Bridges allocate EIDs for downstream endpoints but don't create peer structures until endpoints are actively discovered

Why not just store the allocated pool range on the bridge peer? This would be a considerably simpler change, and means the seaprate net->resv_eid_set doesn't get out-of-sync with the peer data.

@faizana-nvidia
Copy link
Contributor Author

Bridges allocate EIDs for downstream endpoints but don't create peer structures until endpoints are actively discovered

Why not just store the allocated pool range on the bridge peer? This would be a considerably simpler change, and means the seaprate net->resv_eid_set doesn't get out-of-sync with the peer data.

For now we are storing the start of pool and pool size (in some sense it represents allocated pool EID range), would this requirement still needed?

@jk-ozlabs
Copy link
Member

For now we are storing the start of pool and pool size (in some sense it represents allocated pool EID range), would this requirement still needed?

Not sure I understand the question - if the struct peer stores the allocated pool range, then we wouldn't need to store them elsewhere.

In your eccc13f change, the peer->pool_start and peer->pool_size already soft-of represent this, but you don't have a clear indication of whether those represent an actual allocation, or a "proposed" pool. If you can clear up that, then you can just use those fields to represent the allocation data directly.

@faizana-nvidia
Copy link
Contributor Author

For now we are storing the start of pool and pool size (in some sense it represents allocated pool EID range), would this requirement still needed?

Not sure I understand the question - if the struct peer stores the allocated pool range, then we wouldn't need to store them elsewhere.

In your eccc13f change, the peer->pool_start and peer->pool_size already soft-of represent this, but you don't have a clear indication of whether those represent an actual allocation, or a "proposed" pool. If you can clear up that, then you can just use those fields to represent the allocation data directly.

I see, I think I understand what your ask is here, peer->pool_start and peer->pool_size should represent "allocated" information (coming as response to ALLOCATE_EID command whatever bridge's final say is). We are initially storing the "proposed" pool start information and minimum of "proposed" pool size and actual pool_size (response from SET ENDPOINT ID) over these two vars and request is sent for ALLOCATION based on the final proposed data, but once we get a response, whatever has been allocated by bridge, we are keeping it as pool_start and pool_size inside peer. Then later marking this range as reserved in the new reserve_eid_set per network.

But I see your point now, our bridge could have any pool_size as actual size coming in response to SET_ENDPOINT_ID but the proposed size could differ and so could finally allocated size too(response of ALLOCATE_ENDPOINT_ID)

May be we can introduce a new var representing actual_pool_size (response of SET_ENDPOINT ID) and call out these two vars, peer->pool_start and peer->pool_size as allocated data while network reserved_eid_set continues to interact with these vars as it has been doing.

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Jul 4, 2025

I also have added a TODO comment here todo

in our current implementation of Busowner pool and local eid assignment, we don't have proper check to see, if localEID which is being set via mctp tool (mctp addr add <eid> dev <interface>) for any network, has already been used by Busowner when it assigned dynamic/static EID to any MCTP device/endpoint.

Reference: I have two interfaces mctpusb0 and mctpspi0_2. I try to add another local EID to mctspi0_2 (EID 20) . Here EID 20 was already used up via Bridge Downstream EID under mctpusb0 thus the error message, I still see mctp link map retaining the eid 20 even though we had error in setting local eid (not seen in mctp tool but mctpd) Not sure if this was intentionally done?

mctpd[22871]: Adding local eid 20 net 1
mctpd[22871]: mctpd: Local eid 20 net 1 already exists?  <-----------
mctpd[22871]: mctpd: Error handling netlink update         <-----------
mctpd[22871]: 1 changes:
mctpd[22871]:   0 ADD_EID      ifindex   4 (mctpspi0_2          ) eid  20 old_net    0 old_up 5
mctpd[22871]: linkmap
mctpd[22871]:    1: lo, net 1 up local addrs []
mctpd[22871]:    4: mctpspi0_2, net 1 up local addrs [12, 20]
mctpd[22871]:    7: mctpusb0, net 1 up local addrs [8]

$mctp addr show
eid 12 net 1 dev mctpspi0_2
eid 20 net 1 dev mctpspi0_2
eid 8 net 1 dev mctpusb0

Since we are getting into the reservation of EIDs, our new localEID could very well be from the reserved EID space. So should we introduce restriction of such manner in mctp tool/mctpd?

@faizana-nvidia faizana-nvidia mentioned this pull request Jul 4, 2025
@jk-ozlabs
Copy link
Member

May be we can introduce a new var representing actual_pool_size (response of SET_ENDPOINT ID) and call out these two vars, peer->pool_start and peer->pool_size as allocated data while network reserved_eid_set continues to interact with these vars as it has been doing.

Yes, that's along the lines of what I was thinking. But maybe with the names adjusted: pool_start & pool_size would represent actually-allocated values, req_pool_size and req_pool_start for the (temporarily-valid) request details.

And then as an improvement: since the req_ values are transient, is there any better way to pass them from the Set Endpoint ID response handler to the Allocate Endpoint IDs command? (ie., rather than setting transient values on the struct peer)

I also have added a TODO comment here

There's not much we can do about that case; mctpd isn't (currently) managing that local EID, so there is not much we can do on conflicts - as you have noted.

So should we introduce restriction of such manner in mctp tool/mctpd?

No, not while mctpd is not responsible for assigning local EIDs.

I see the best way to prevent this case is a facility to explicitly configure mctpd with the range of dynamically-allocatable EIDs, and it's up to the platform integrator to ensure that that range does not conflict with the potential set of local EIDs.

(that's one of the reasons for the configuration file support, for exactly this use-case)

@faizana-nvidia
Copy link
Contributor Author

And then as an improvement: since the req_ values are transient, is there any better way to pass them from the Set Endpoint ID response handler to the Allocate Endpoint IDs command? (ie., rather than setting transient values on the struct peer)

  1. To add further onto this, we don't know how many birdges on what bindings would turn up on network. Predicting their numbers would rather be tedious, so would be passing each pool size to their respective Allocate EndpointID. So in worse case scenario assuming 255 bridges per network(similar to peer numbers) we can have a separate array structure representing bridge data/pool_size which would be filled right after SET_ENDPOINT_ID.

  2. One another way (though not tested) is somehow find the capacity of bridge just before asking ALLOCATE_ENDPOINT_ID. Unfortunately we need to know this because we want to avoid asking endpoints number in request which would be more that Bridge's dynamic pool size.

Table 23 – Allocate Endpoint IDs message

An error completion code (ERROR_INVALID_DATA should be returned) shall be returned if the number of EIDs being allocated (Number of Endpoint IDs) exceeds the Dynamic Endpoint ID Pool size. (This error condition does not apply to when the number of endpoint IDs passed in the request is 0x00).

  1. We can also think this response of pool_size in SET_EDNPOINT_ID (transient) to be the capacity of bridge while the allocated pool size and start representing allocated values

I also have added a TODO comment here

There's not much we can do about that case; mctpd isn't (currently) managing that local EID, so there is not much we can do on conflicts - as you have noted.

So should we introduce restriction of such manner in mctp tool/mctpd?

No, not while mctpd is not responsible for assigning local EIDs.

I see the best way to prevent this case is a facility to explicitly configure mctpd with the range of dynamically-allocatable EIDs, and it's up to the platform integrator to ensure that that range does not conflict with the potential set of local EIDs.

(that's one of the reasons for the configuration file support, for exactly this use-case)

Acknowledged.

@jk-ozlabs
Copy link
Member

we can have a separate array structure representing bridge data/pool_size which would be filled right after SET_ENDPOINT_ID.

If there is no way to better handle the transient requested peer size, I would prefer to just store it on the struct peer.

One another way (though not tested) is somehow find the capacity of bridge just before asking ALLOCATE_ENDPOINT_ID

We have the capacity of the brige, from the Set Endpoint ID response, no? As you say:

We can also think this response of pool_size in SET_EDNPOINT_ID (transient) to be the capacity of bridge while the allocated pool size and start representing allocated values

I would suggest:

  1. record the requested pool size from the Set Endpoint ID response
  2. attempt to allocate that requested pool size (pending any configured limits), by querying the struct peer array
  3. record the allocated pool range on that peer
  4. perform Allocate Endpoint IDs from that allocated pool

If the allocation is not possible in (2), just fail for now. We can look at fallbacks in a follow-up.

I think this it fine to do without a new dbus call, but am open to ideas about why we may need one.

@faizana-nvidia
Copy link
Contributor Author

we can have a separate array structure representing bridge data/pool_size which would be filled right after SET_ENDPOINT_ID.

If there is no way to better handle the transient requested peer size, I would prefer to just store it on the struct peer.

One another way (though not tested) is somehow find the capacity of bridge just before asking ALLOCATE_ENDPOINT_ID

We have the capacity of the brige, from the Set Endpoint ID response, no? As you say:

We can also think this response of pool_size in SET_EDNPOINT_ID (transient) to be the capacity of bridge while the allocated pool size and start representing allocated values

I would suggest:

  1. record the requested pool size from the Set Endpoint ID response
  2. attempt to allocate that requested pool size (pending any configured limits), by querying the struct peer array
  3. record the allocated pool range on that peer
  4. perform Allocate Endpoint IDs from that allocated pool

If the allocation is not possible in (2), just fail for now. We can look at fallbacks in a follow-up.

I think this it fine to do without a new dbus call, but am open to ideas about why we may need one.

Please let me know if I understood you correctly on steps

  1. Fetch out the transient pool size (via SET_ENDPOINT_ID).
  2. Try to allocate (find the pool sized chunked of peer structure which are available to be later used by downstream EID via get_pool_start())
  3. If found then update the peer structure with req_pool_size = capacity and req_pool_start = get_pool_start()
  4. Also mark all such eids to be reserved
  5. From dbus api AssignBridgeStatic get pool_size and try to allocate that size with req_pool_start and whatever is coming as response from ALLOCATE_EID command update peer structure with alloc_pool_size = response(pool_size) and alloc_pool_start=response(pool_start).

Introduces req_pool_size and req_pool_start, updates pool_size -> alloc_pool_size and pool_start ->alloc_pool_start.

If we are going by this then do we have to get pool_start argument from dbus method AssignBridgeStatic if we have already decided what pool start will be from req_pool_start?

Or do you mean not to use any new dbus method AssignBridgeStatic at all?

I think this it fine to do without a new dbus call, but am open to ideas about why we may need one.

@jk-ozlabs
Copy link
Member

Since the allocation mechanism is now in #71, I'll shift this discussion to there.

@faizana-nvidia faizana-nvidia force-pushed the mctp-bridge-poll branch 2 times, most recently from 2746a86 to a472892 Compare July 22, 2025 19:59
@faizana-nvidia faizana-nvidia changed the title MCTP Bridge Polling and Downstream EID reservation MCTP Bridge Polling Jul 23, 2025
@faizana-nvidia
Copy link
Contributor Author

Hello Jeremy,
I've rebased the PR and updated the logic based on other PR 71 comments. Please help me with the review.

@jk-ozlabs
Copy link
Member

I'm going to hold off on reviewing this one until we have the core bridge support done, given that's a hard dependency for this.

@faizana-nvidia
Copy link
Contributor Author

I'm going to hold off on reviewing this one until we have the core bridge support done, given that's a hard dependency for this.

understood, sounds good

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Aug 15, 2025

Hello Jeremy,
I have some questions on discussed approach of Polling (discussion) mechanism for MCTP Bridge downstream endpoint management.
To give you a better picture on MCTP Network, we have MCU as MCTP Bridge(over USB) managing GPUs at its downstream

graph LR
    A["HMC/BMC"] --> B["MCTP Bridge"]
    B --> C["GPU1"]
    B --> D["GPU2"]                                          
Loading

I'm aware that in the (discussed) polling approach, mctpd doesn't care about state of devices behind the bridge once they have been discovered.
In our platform many of our applications such as Nvidia NSM (used for GPU Telemetry), relies directly on Event Subscription model or something similar which creates a requirement to explicitly do a re-subscribe to endpoints again when they undergoing reset. Owing to this and among other similar reasons, this current PR (MCTP Bridge Polling )proposed for keeping the polling active throughout the existence of MCTP Bridge itself and inform to other d-bus endpoint object monitoring applications about case where MCTP device has fallen off the network. This would greatly help these applications to initiate their initial discovery sequence as early/quickly as possible without losing out valuable telemetry data and critical events since they would be made aware of such reset via MCTPD.

If this seems like a legit reason to you for which we can think of keeping continuous polling, I have further question on implementation

The particular use case which might not get solved even with Polling is fast reset of downstream endpoints such as GPUs themselves which happens in time-window of couple of milliseconds (100-200 msec) since polling window is not be narrow enough to capture such fast reset and from MCTP perspective endpoint never gotten off the MCTP network. In such case NSM would not even get notified about this fast-reset of GPU.

Need some help with figuring out how do we expect endpoint to notify or BusOwner to become aware of such fast reset use case.

I'm open to any further guidance/opinion on Polling mechanism itself, or should MCTP even care about this etc. Or may be help me understand this better on the philosophy of mctpd not caring about state of device downstream to bridge or direct-peer itself.

@jk-ozlabs
Copy link
Member

I'm open to any further guidance/opinion on Polling mechanism itself, or should MCTP even care about this etc. Or may be help me understand this better on the philosophy of mctpd not caring about state of device downstream to bridge or direct-peer itself.

The main design point here is that mctpd manages the MCTP transport itself, so is only concerned with the state of connectivity to managed MCTP devices. It does not manage the general state of those devices, like your scenario above.

For example, in the fast-reset case: nothing has changed in mctpd's view of the network. The device has been reset, but it sounds like it has restored some MCTP state, in that it is still using its assigned EID - so no connectivity has been changed.

(or does a reset clear the EID? in which case the reset would be visible to mctpd?)

From what you have mentioned though, the issue there seems to be that the device is retaining some state (the EID) but not other state (the event subscriptions).

We do not perform ongoing polling of the MCTP devices, as we cede that responsibility to upper-level applications once connectivity has been established. If those applications detect a loss of connectivity, they call Recover to restore mctpd's responsibility to check (and potentially re-establish) connectivity.

This means that we do not detect endpoints that may disappear but have no upper-level applications interacting with them. This is a deliberate decision - if nothing is communicating with them, do we care that we have connectivity?

For the reset-detection in general though, some upper-layer protocols have mechanisms to detect that, which would typically re-trigger some form of setup.

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Aug 19, 2025

Thank you for the clarification, your words do make sense to me.

(or does a reset clear the EID? in which case the reset would be visible to mctpd?)

for our platform devices, yes it does but it's given back same EID by the Bridge, meaning once device is out of reset it responds to next MCTP command (GET_ENDPOINT_ID).

Reset would be visible to mctpdif any application triggers a Recover call right?

For the reset-detection in general though, some upper-layer protocols have mechanisms to detect that, which would typically re-trigger some form of setup.

Well this is the problem I'm currently facing, Nvidia NSM doesn't have a direct mechanism of checking reset that too which is happening that quick (100msec). Even if they do something like polling to keep in check of state of device this too would be not that frequent to find such quick reset.

Had this been a direct endpoint (non Bridge) to BMC, would MCTP Discovery Notify command be the right way for endpoint to rely on to instate its reset, as in that case eid would be reset for the endpoint and it would be freshly added endpoint to MCTP network? If so, is there any plan to include this support on MCTPD? From your previous comments and other issue discussion I suppose your take is different when it comes to Discovery Notify and when it is to be used.

@jk-ozlabs
Copy link
Member

for our platform devices, yes it does but it's given back same EID by the Bridge, meaning once device is out of reset it responds to next MCTP command (GET_ENDPOINT_ID).

Ah, so there's no direct connection, I see.

Had this been a direct endpoint (non Bridge) to BMC, would MCTP Discovery Notify command be the right way for endpoint to rely on to instate its reset, as in that case eid would be reset for the endpoint and it would be freshly added endpoint to MCTP network?

No, the discovery notify command is only between the endpoint and its direct bus owner. From DSP 0236 § 12.15 "Discovery Notify":

This message should only be sent from endpoints to the bus owner for the bus that the endpoint is on

To me, the spec reads as if the intention for Discovery Notify is to provide a bus-hotplug mechanism when the physical bus does not already provide it.

Even without the direct connection, it sounds like you're attempting to use the MCTP control protocol to communicate upper-layer protocol state. I would think that that device state (ie, the event subscriptions and/or reset condition) really belong as part of that upper-layer protocol.

For example, NVMe-MI has facilities to indicate that a subsystem has been reset (see the "NVM Subsystem Reset Occurred" event), perhaps you could do something similar here?

[I am not familiar with NSM though]

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Aug 25, 2025

for our platform devices, yes it does but it's given back same EID by the Bridge, meaning once device is out of reset it responds to next MCTP command (GET_ENDPOINT_ID).

Ah, so there's no direct connection, I see.

Had this been a direct endpoint (non Bridge) to BMC, would MCTP Discovery Notify command be the right way for endpoint to rely on to instate its reset, as in that case eid would be reset for the endpoint and it would be freshly added endpoint to MCTP network?

No, the discovery notify command is only between the endpoint and its direct bus owner. From DSP 0236 § 12.15 "Discovery Notify":

This message should only be sent from endpoints to the bus owner for the bus that the endpoint is on

To me, the spec reads as if the intention for Discovery Notify is to provide a bus-hotplug mechanism when the physical bus does not already provide it.

Even without the direct connection, it sounds like you're attempting to use the MCTP control protocol to communicate upper-layer protocol state. I would think that that device state (ie, the event subscriptions and/or reset condition) really belong as part of that upper-layer protocol.

For example, NVMe-MI has facilities to indicate that a subsystem has been reset (see the "NVM Subsystem Reset Occurred" event), perhaps you could do something similar here?

[I am not familiar with NSM though]

Thank you for clarification (I agree with you, may be there lies some other way to handle this use-case, I'll go over NVMe-MI implementation and NVM Subsystem Reset Occurred event).

I'll update the PR with rebase and some more modification. Please hold on the review till then.

@faizana-nvidia faizana-nvidia force-pushed the mctp-bridge-poll branch 2 times, most recently from fdaa069 to 79e5e46 Compare August 28, 2025 19:32
@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Aug 28, 2025

Hello Jeremy,

Updated the PR with last discussion on stopping of polling. It's up for review now.

Sorry to be reluctant on this, but is there any plan to support Discovery Notify for direct peer to BMC as bus owner specifically for some binding where we don't explicitly have methods like I2C bus detection to identify endpoint presence?

@faizana-nvidia faizana-nvidia force-pushed the mctp-bridge-poll branch 2 times, most recently from cf98296 to c892110 Compare August 28, 2025 20:01
@jk-ozlabs
Copy link
Member

Thanks for the updates.

You now seem to have a bunch of fixes/changes to the bridging implementation, which don't belong in the polling feature addition.

Can you separate those into a different PR please?

@faizana-nvidia
Copy link
Contributor Author

Thanks for the updates.

You now seem to have a bunch of fixes/changes to the bridging implementation, which don't belong in the polling feature addition.

Can you separate those into a different PR please?

Created a new pr for tracking any other fixes needed for bridge support.
#110

But keeping these commits in this PR too for addressing to test cases, will rebase once other PR gets approved

Refering to DSP0236 sec 8.17.6 Reclaiming EIDs from hot-plug devices

Requirement: The bus owner/bridge needs to periodically poll
GET_ENDPOINT_ID control command to check if downstream endpoint
is accessible. Once it's established that endpoin is accessible,
polling needs to stop.

Introduce new endpoint polling configuration to be used as periodic
interval for sending poll command GET_ENDPOINT_ID by bus owner/bridge.

Signed-off-by: Faizan Ali <[email protected]>
Since downstream endpoints are behind the bridge, available
query_get_endpoint_id() won't be able to redirect the
GET_ENDPOINT_ID command packet behind the bridge physical
address of downstream endpoint would be same as bridge's own.

But we can do direct query to endpoint since routes have already
been layed out when bridge was allocated the eid pool space.

Signed-off-by: Faizan Ali <[email protected]>
Implement endpoint periodic polling mechanism to validate bridged
endpoint accessiblity. Begin polling as soon as gateway routes are
created. Stop polling once it's established that endpoint path is
accessible.

Publish peer path once downstream endpoint responds to send poll
command.

Signed-off-by: Faizan Ali <[email protected]>
Remove all downstream endpoints as well if bridge is being removed.
Also stop endpoint periodic polling.

Signed-off-by: Faizan Ali <[email protected]>
Added new bus-owner configuration for periodic polling of
bridged endpoints.

Signed-off-by: Faizan Ali <[email protected]>
Currently find_endpoint incorrectly tries to find physical address for
sent EID. This fails in case EID is a gatewayed endpoint because don't
have neighbors or physical address.

For gateway routes, return the gateway's physical address instead.
The gateway endpoint will then forward the message to the correct
bridged endpoint internally via looking into its bridged endpoint list.

Signed-off-by: Faizan Ali <[email protected]>
Test that bridge polling discovers downstream endpoints and
creates endpoints d-bus object.

Test that polling stops once downstream endpoint is discovered.
and continues unless endpoint reponds to send polling command
GET_ENDPOINT_ID.

Test that once brige endpoint is removed, downstream endpoints
associated to the brige also gets removed too.

Signed-off-by: Faizan Ali <[email protected]>
@faizana-nvidia
Copy link
Contributor Author

rebased the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants