Skip to content

Conversation

MasterSkepticista
Copy link
Collaborator

@MasterSkepticista MasterSkepticista commented Apr 30, 2025

Motivation

Collaborators today fetch each model tensor from the aggregator via a dedicated RPC call. Aggregator has limited resources to serve requests, and it is not uncommon to have hundreds (if not thousands) of model tensors waiting to be served to each collaborator. A federation itself may have hundreds (if not thousands) of participants making these requests.

Example

Consider torch/histology experiment.

  • Model size is ~64MB with about 20 tensors.
  • In a federation with 8 collaborators, a total of 160 RPC calls are made each round to the aggregator just to fetch latest model tensors.
  • In contrast, only (get_tasks + 3 x send_local_task_results) x 8 collaborators 32 RPC calls are made for other purposes in each round.
  • >80% of all requests are consumed in serving tensors. This also leads to thread contention. Aggregator cannot generally afford more than 8-32 threads assuming modest CPU/memory availability.
  • Had the tensors been batched per request, each collaborator could be served with a single RPC thread and no aggregator-side contention.

Problem gets worse when models have hundreds of tensors and experiments span more collaborators.

Goal of this PR

This is Part 1 (of 2) PRs that adds support for batched fetching of tensors over gRPC. This PR makes the following major changes:

  1. Remove send_model_deltas and use_delta_updates support.
  2. Simplify nparray_to_named_tensor (henceforth called serialize_tensor) and named_tensor_to_nparray (henceforth called deserialize_tensor) without any brittle conditional checks.
  3. Batch gRPC requests for tensors on the collaborator and aggregator side.

In Part 2 of this PR, delta update support will be added back. The reason for removal is to straighten the logic for actions that concern a communication pipeline (de/compression, de/serialization) and the aggregator/collaborator component.

Implementation

A new RPC call replaces the get_aggregated_tensor with get_aggregated_tensors (plural 's'). Collaborators request a batch of tensors in a new TensorSpec format.

message TensorSpec {
  string tensor_name = 1;
  int32 round_number = 2;
  bool report = 3;
  repeated string tags = 4;
  bool require_lossless = 5;
}

You may observe the resemblance with TensorKey, except that origin field is missing.
The RPC request and response formats are shown below:

message GetAggregatedTensorsRequest {
  MessageHeader header = 1;
  repeated TensorSpec tensor_specs = 2;
}

message GetAggregatedTensorsResponse {
  MessageHeader header = 1;
  repeated NamedTensor tensors = 2;
}

On collaborator side, tensors are fetched via self.fetch_tensors_from_aggregator(tensor_keys) where tensor_keys are of type List[TensorKey].

Cost/Benefit Analysis

One downside of this PR is that potential bandwidth savings achieved due to delta updates is lost.
The upside of this PR is a significantly simplified mental model of how tensors are processed on both ends, higher correctness confidence and higher maintainability.

Indeed, the second part will close the drawback by bringing in delta updates, keeping the mental model simple.

Reviewer note(s): This PR has a lot of cleanup. It is best understood by looking at the proposed changes directly and not diffs.

@MasterSkepticista MasterSkepticista force-pushed the karansh1/batched_fetch branch from cef758c to 19c29e2 Compare May 15, 2025 17:59
@MasterSkepticista MasterSkepticista changed the title Batched fetching of tensors to reduce RPC calls Batched fetching of tensors to reduce RPC calls [1/2] May 15, 2025
Copy link
Collaborator

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments from my first read-through:

Copy link
Collaborator

@kminhta kminhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this up @MasterSkepticista - this is a great fundamental adjustment to how we transport the model across the network. I don't see any major concerns, I'm going to go ahead and approve this after my review and our offline discussion with the understanding that:

  1. delta updates get reintroduced in the follow up PR
  2. streaming gets reintroduced in a more generic manner (as you called out in #1575 (comment)) in some subsequent PR as well - or otherwise some manner of handling large models from both the aggregator and collaborator

I know you called out point 1 in the PR, but do you have a mental model for how point 2 can be achieved or is the general implementation still an open? I ask mainly because it would be good to close the gap for first class LLM support (thinking in terms of pretraining), but it is also true that LLM fine-tuning has a large library of parameter efficient methods that may still allow for expanding OpenFL-LLM capabilities in the interim

@kminhta
Copy link
Collaborator

kminhta commented May 18, 2025

FYI, a couple of the tests are failing - secure aggregation in particular https://github.com/securefederatedai/openfl/actions/runs/15063524782/job/42343165534?pr=1575

@kminhta
Copy link
Collaborator

kminhta commented May 18, 2025

Also, going to tag @ishaileshpant to keep him in the loop as there are modifications to RPC calls that may affect #1500

Copy link
Collaborator

@porteratzo porteratzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MasterSkepticista MasterSkepticista force-pushed the karansh1/batched_fetch branch from cdbe2d6 to dbb6eca Compare May 19, 2025 13:34
payalcha and others added 2 commits May 19, 2025 19:05
Signed-off-by: payalcha <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Co-authored-by: Payal Chaurasiya <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
@MasterSkepticista MasterSkepticista force-pushed the karansh1/batched_fetch branch from 34949fd to fcbfff4 Compare May 19, 2025 15:11
Copy link
Collaborator

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @MasterSkepticista !

Copy link
Collaborator

@theakshaypant theakshaypant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I do think we wait for 1.9 cut-off before merging this given the temporary removal of use_delta_updates.

@MasterSkepticista MasterSkepticista force-pushed the karansh1/batched_fetch branch from 8b1b27b to 6bce94f Compare May 21, 2025 07:01
@MasterSkepticista MasterSkepticista force-pushed the karansh1/batched_fetch branch from a56ce95 to 6bce94f Compare May 22, 2025 05:37
ishaileshpant added a commit to ishaileshpant/openfl that referenced this pull request May 28, 2025
…efederatedai#1575

 - rename the function to get_aggregated_tensors
 - adjust both grpc and rest client for name and signature change in api
 - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc
 - fix the test cases

Signed-off-by: Shailesh Pant <[email protected]>
ishaileshpant added a commit to ishaileshpant/openfl that referenced this pull request May 28, 2025
…efederatedai#1575

 - rename the function to get_aggregated_tensors
 - adjust both grpc and rest client for name and signature change in api
 - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc
 - fix the test cases
rebased
 29th.May.1
Signed-off-by: Shailesh Pant <[email protected]>
ishaileshpant added a commit to ishaileshpant/openfl that referenced this pull request May 28, 2025
…efederatedai#1575

 - rename the function to get_aggregated_tensors
 - adjust both grpc and rest client for name and signature change in api
 - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc
 - fix the test cases
rebased
 29th.May.1
Signed-off-by: Shailesh Pant <[email protected]>
ishaileshpant added a commit to ishaileshpant/openfl that referenced this pull request May 29, 2025
…efederatedai#1575

 - rename the function to get_aggregated_tensors
 - adjust both grpc and rest client for name and signature change in api
 - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc
 - fix the test cases
rebased
 29th.May.1
Signed-off-by: Shailesh Pant <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants