Skip to content

Conversation

faizana-nvidia
Copy link
Contributor

This PR addresses minor fixes for bridge implementation.

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Aug 29, 2025

I'm a bit confused here, while I tested the intended gateway route creation in test framework, that seem to be working as expected if ifindex = 0 as arg in mctp_nl_route_add().

But same fails in my machine.

Aug 29 07:56:49 hgxb300 mctpd[3669]: Error from kernel: Invalid argument (-22)
Aug 29 07:56:49 hgxb300 mctpd[3669]: incorrect format
Aug 29 07:56:49 hgxb300 mctpd[3669]: mctpd: Failed to add gateway route for EID 10: Invalid argument

though per my understanding of implementation func: fill_rtalter_args() in mctpd we do need to pass ifindex zero

	if (ifindex) {
		msg->nh.nlmsg_len += mctp_put_rtnlmsg_attr(
			&rta, &rta_len, RTA_OIF, &ifindex, sizeof(ifindex));
	} else {
		msg->nh.nlmsg_len += mctp_put_rtnlmsg_attr(
			&rta, &rta_len, RTA_GATEWAY, gw, sizeof(*gw));
	}

on further inspecting looks like RTA_GATEWAY rta policy is not supported yet in the latest kernel ?? I remember testing this with your branch you gave last time to test from https://github.com/CodeConstruct/linux/tree/dev/forwarding

@jk-ozlabs
Copy link
Member

The forwarding patches are upstream, but not in a released version as yet. 6.17 should have the merged gateway support.

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Sep 1, 2025

The forwarding patches are upstream, but not in a released version as yet. 6.17 should have the merged gateway support.

Thanks for the infor, will this be a blocker in merge till release happens?

@jk-ozlabs
Copy link
Member

No, that shouldn't block us here, as long as you're able to test with a pre-release kernel.

@faizana-nvidia
Copy link
Contributor Author

No, that shouldn't block us here, as long as you're able to test with a pre-release kernel.

Ack, will update soon on testing.

@faizana-nvidia
Copy link
Contributor Author

I was able to test this.

mctp route show
eid min 11 max 25 net 1 gw 10 mtu 68
eid min 10 max 10 net 1 dev mctpusb0 mtu 0

@jk-ozlabs
Copy link
Member

On cc7ea36, you mention:

If endpoint responds with some invalid endpoint id range, assumed
bridge pool space needs to reset too, this is needed as it misleads
remove_peer() to delete gateway routes too which only be created
after bridge pool allocation

I think we also have an issue that the pool route is not created until after the allocate endpoint IDs call. This gives us the following race:

BO bridge EP
Sends Allocate Endpoint IDs
Allocates downstream endpoint ids
Sends Set Endpoint ID to EP
now has an EID, sends a message to BO
Receives message, cannot reply
Adds pool route

which is essentially the same thing fixed in 9711f0b, but with bridges.

@faizana-nvidia
Copy link
Contributor Author

faizana-nvidia commented Sep 3, 2025

On cc7ea36, you mention:

If endpoint responds with some invalid endpoint id range, assumed
bridge pool space needs to reset too, this is needed as it misleads
remove_peer() to delete gateway routes too which only be created
after bridge pool allocation

I think we also have an issue that the pool route is not created until after the allocate endpoint IDs call. This gives us the following race:

BO bridge EP
Sends Allocate Endpoint IDs
Allocates downstream endpoint ids
Sends Set Endpoint ID to EP
now has an EID, sends a message to BO
Receives message, cannot reply
Adds pool route
which is essentially the same thing fixed in 9711f0b, but with bridges.

If i understand the original problem statement correctly which got fixed in 9711f0b,, the issue is right after the moment EID is assigned to an endpoint, we were spending few moments during publishing the peer and then setting up the routes, right at this window, endpoint could very well begin communicating to BusOwner which BusOwner might not be able to respond to since routes doesn't exist yet.

Going by same implication, we should setup routes for downstream eid just before AllocateEndpoint ID and may have delete these routes if Allocation fails.

@faizana-nvidia faizana-nvidia force-pushed the fix_bridge_implementation branch from 1303e95 to 9baff49 Compare September 3, 2025 14:58
@jk-ozlabs
Copy link
Member

Going by same implication, we should setup routes for downstream eid just before AllocateEndpoint ID and may have delete these routes if Allocation fails.

yes, that's correct.

However, no specific need to have that fixed in this PR; more a note that we'll need to address it at some point.

@jk-ozlabs
Copy link
Member

However, no specific need to have that fixed in this PR; more a note that we'll need to address it at some point.

Ah, you have already added it here. All good!

Copy link
Member

@jk-ozlabs jk-ozlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor things, plus a typo (birdge) in the commit message for patch 3.

Otherwise, all looks good.

For bridged endpoints, route setup is already done once pool space
is allocated successfully. These are destined to be routed from
the bridge itself.

Add appropiate check to avoid extra route creation while setting up
bridged endpoints.

Signed-off-by: Faizan Ali <[email protected]>
If endpoint responds with some error in
endpoint_send_set_endpoint_id() , assumed bridge pool space
needs to reset too, this is needed as it misleads remove_peer()
to delete gateway routes too which will only be created
after bridge pool allocation

Signed-off-by: Faizan Ali <[email protected]>
if ifindex is zero and mctp_fq_addr is valid then only
gateway routes are created.

Immediately after bridge allocates the EID to downstream
endpoint, they could initiate communication to bridge ->
busowner which requires routes to those endpoints for busowner
to respond back.

Fix: update ifindex as zero for gateway route creation and
deletion. Also setup routes for downstream endpoints right
allocation of eid happens by the bridge.

Signed-off-by: Faizan Ali <[email protected]>
@faizana-nvidia faizana-nvidia force-pushed the fix_bridge_implementation branch from 9baff49 to 3e0e9c2 Compare September 4, 2025 05:12
@faizana-nvidia
Copy link
Contributor Author

Addressed the comments, once this merges, will rebase other PR 85 to ease the review

@jk-ozlabs jk-ozlabs merged commit 3e0e9c2 into CodeConstruct:main Sep 8, 2025
3 checks passed
@jk-ozlabs
Copy link
Member

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants