Batched fetching of tensors to reduce RPC calls [1/2] #1575

MasterSkepticista · 2025-04-30T11:51:19Z

Motivation

Collaborators today fetch each model tensor from the aggregator via a dedicated RPC call. Aggregator has limited resources to serve requests, and it is not uncommon to have hundreds (if not thousands) of model tensors waiting to be served to each collaborator. A federation itself may have hundreds (if not thousands) of participants making these requests.

Example

Consider torch/histology experiment.

Model size is ~64MB with about 20 tensors.
In a federation with 8 collaborators, a total of 160 RPC calls are made each round to the aggregator just to fetch latest model tensors.
In contrast, only (get_tasks + 3 x send_local_task_results) x 8 collaborators 32 RPC calls are made for other purposes in each round.
>80% of all requests are consumed in serving tensors. This also leads to thread contention. Aggregator cannot generally afford more than 8-32 threads assuming modest CPU/memory availability.
Had the tensors been batched per request, each collaborator could be served with a single RPC thread and no aggregator-side contention.

Problem gets worse when models have hundreds of tensors and experiments span more collaborators.

Goal of this PR

This is Part 1 (of 2) PRs that adds support for batched fetching of tensors over gRPC. This PR makes the following major changes:

Remove send_model_deltas and use_delta_updates support.
Simplify nparray_to_named_tensor (henceforth called serialize_tensor) and named_tensor_to_nparray (henceforth called deserialize_tensor) without any brittle conditional checks.
Batch gRPC requests for tensors on the collaborator and aggregator side.

In Part 2 of this PR, delta update support will be added back. The reason for removal is to straighten the logic for actions that concern a communication pipeline (de/compression, de/serialization) and the aggregator/collaborator component.

Implementation

A new RPC call replaces the get_aggregated_tensor with get_aggregated_tensors (plural 's'). Collaborators request a batch of tensors in a new TensorSpec format.

message TensorSpec {
  string tensor_name = 1;
  int32 round_number = 2;
  bool report = 3;
  repeated string tags = 4;
  bool require_lossless = 5;
}

You may observe the resemblance with TensorKey, except that origin field is missing.
The RPC request and response formats are shown below:

message GetAggregatedTensorsRequest {
  MessageHeader header = 1;
  repeated TensorSpec tensor_specs = 2;
}

message GetAggregatedTensorsResponse {
  MessageHeader header = 1;
  repeated NamedTensor tensors = 2;
}

On collaborator side, tensors are fetched via self.fetch_tensors_from_aggregator(tensor_keys) where tensor_keys are of type List[TensorKey].

Cost/Benefit Analysis

One downside of this PR is that potential bandwidth savings achieved due to delta updates is lost.
The upside of this PR is a significantly simplified mental model of how tensors are processed on both ends, higher correctness confidence and higher maintainability.

Indeed, the second part will close the drawback by bringing in delta updates, keeping the mental model simple.

Reviewer note(s): This PR has a lot of cleanup. It is best understood by looking at the proposed changes directly and not diffs.

Signed-off-by: Shah, Karan <[email protected]>

teoparvanov

I have a couple of comments from my first read-through:

openfl/component/aggregator/aggregator.py

openfl/protocols/aggregator.proto

openfl/transport/grpc/aggregator_client.py

openfl/component/collaborator/collaborator.py

kminhta

Thanks for taking this up @MasterSkepticista - this is a great fundamental adjustment to how we transport the model across the network. I don't see any major concerns, I'm going to go ahead and approve this after my review and our offline discussion with the understanding that:

delta updates get reintroduced in the follow up PR
streaming gets reintroduced in a more generic manner (as you called out in #1575 (comment)) in some subsequent PR as well - or otherwise some manner of handling large models from both the aggregator and collaborator

I know you called out point 1 in the PR, but do you have a mental model for how point 2 can be achieved or is the general implementation still an open? I ask mainly because it would be good to close the gap for first class LLM support (thinking in terms of pretraining), but it is also true that LLM fine-tuning has a large library of parameter efficient methods that may still allow for expanding OpenFL-LLM capabilities in the interim

kminhta · 2025-05-18T14:21:11Z

FYI, a couple of the tests are failing - secure aggregation in particular https://github.com/securefederatedai/openfl/actions/runs/15063524782/job/42343165534?pr=1575

kminhta · 2025-05-18T14:22:12Z

Also, going to tag @ishaileshpant to keep him in the loop as there are modifications to RPC calls that may affect #1500

openfl/component/aggregator/aggregator.py

porteratzo

LGTM

Signed-off-by: payalcha <[email protected]> Signed-off-by: Shah, Karan <[email protected]>

Signed-off-by: Shah, Karan <[email protected]> Co-authored-by: Payal Chaurasiya <[email protected]> Signed-off-by: Shah, Karan <[email protected]>

Signed-off-by: Shah, Karan <[email protected]>

teoparvanov

LGTM, thanks @MasterSkepticista !

openfl/component/aggregator/aggregator.py

Signed-off-by: Shah, Karan <[email protected]>

…eratedai/openfl into karansh1/batched_fetch

theakshaypant

Nice!
I do think we wait for 1.9 cut-off before merging this given the temporary removal of use_delta_updates.

Signed-off-by: Shah, Karan <[email protected]>

…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases Signed-off-by: Shailesh Pant <[email protected]>

…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <[email protected]>

MasterSkepticista added 3 commits April 15, 2025 21:27

Merge relative tensorkeys loop

3373551

Signed-off-by: Shah, Karan <[email protected]>

Remove round_number field: never used, also named_tensor has it

4d1ae93

Signed-off-by: Shah, Karan <[email protected]>

Disable streaming on send task results for consistency

f6ac53e

Signed-off-by: Shah, Karan <[email protected]>

MasterSkepticista requested review from kminhta, psfoley and teoparvanov as code owners April 30, 2025 11:51

MasterSkepticista and others added 10 commits April 30, 2025 17:22

Task callback rebase

8ca8d46

Signed-off-by: Shah, Karan <[email protected]>

formatting

3449009

Signed-off-by: Shah, Karan <[email protected]>

Raise exception if tensor not found

5da4cba

Signed-off-by: Shah, Karan <[email protected]>

Merge branch 'develop' into karansh1/batched_fetch

f7a1c0f

Fix line too long

d0f1345

Signed-off-by: Shah, Karan <[email protected]>

Formatting

3d6922a

Signed-off-by: Shah, Karan <[email protected]>

Batched fetch

5912127

Signed-off-by: Shah, Karan <[email protected]>

Merge branch 'develop' into karansh1/batched_fetch

2722e23

MR changes

9149a17

Signed-off-by: Shah, Karan <[email protected]>

Merge branch 'develop' into karansh1/batched_fetch

5c149d2

MasterSkepticista mentioned this pull request May 15, 2025

Improved logging in collaborator module #1627

Open

MasterSkepticista added 2 commits May 15, 2025 22:43

Simplify collaborator side send/receive with no delta calculations

c915b16

Signed-off-by: Shah, Karan <[email protected]>

Simplify aggregator with no deltas and simple compress/decompress

19c29e2

Signed-off-by: Shah, Karan <[email protected]>

MasterSkepticista force-pushed the karansh1/batched_fetch branch from cef758c to 19c29e2 Compare May 15, 2025 17:59

MasterSkepticista changed the title ~~Batched fetching of tensors to reduce RPC calls~~ Batched fetching of tensors to reduce RPC calls [1/2] May 15, 2025

teoparvanov reviewed May 16, 2025

View reviewed changes

openfl/component/aggregator/aggregator.py Show resolved Hide resolved

openfl/protocols/aggregator.proto Outdated Show resolved Hide resolved

openfl/transport/grpc/aggregator_client.py Outdated Show resolved Hide resolved

openfl/component/collaborator/collaborator.py Show resolved Hide resolved

kminhta approved these changes May 18, 2025

View reviewed changes

porteratzo reviewed May 18, 2025

View reviewed changes

openfl/component/aggregator/aggregator.py Show resolved Hide resolved

porteratzo approved these changes May 18, 2025

View reviewed changes

MasterSkepticista force-pushed the karansh1/batched_fetch branch from cdbe2d6 to dbb6eca Compare May 19, 2025 13:34

payalcha and others added 2 commits May 19, 2025 19:05

Add sleep between collab start (#1628)

b13f19e

Signed-off-by: payalcha <[email protected]> Signed-off-by: Shah, Karan <[email protected]>

Fix image shape (#1626)

1bfbe6d

Signed-off-by: Shah, Karan <[email protected]> Co-authored-by: Payal Chaurasiya <[email protected]> Signed-off-by: Shah, Karan <[email protected]>

MasterSkepticista force-pushed the karansh1/batched_fetch branch from dbb6eca to 4ae2b31 Compare May 19, 2025 13:45

MasterSkepticista requested review from theakshaypant, noopurintel, payalcha, rahulga1, rajithkrishnegowda, ishaileshpant and tanwarsh as code owners May 19, 2025 13:45

MasterSkepticista added 2 commits May 19, 2025 19:58

Update RPC docstring

2c413d7

Signed-off-by: Shah, Karan <[email protected]>

Update infosec-overview.md

fcbfff4

Signed-off-by: Shah, Karan <[email protected]>

MasterSkepticista force-pushed the karansh1/batched_fetch branch from 34949fd to fcbfff4 Compare May 19, 2025 15:11

Merge branch 'develop' into karansh1/batched_fetch

d76cec1

teoparvanov approved these changes May 20, 2025

View reviewed changes

openfl/component/aggregator/aggregator.py Show resolved Hide resolved

MasterSkepticista added 2 commits May 20, 2025 20:02

Update tests

31ec500

Signed-off-by: Shah, Karan <[email protected]>

Merge branch 'karansh1/batched_fetch' of https://github.com/securefed…

f18aeb1

…eratedai/openfl into karansh1/batched_fetch

payalcha added the eden_compression label May 20, 2025

Merge branch 'develop' into karansh1/batched_fetch

fbb4fb7

theakshaypant approved these changes May 21, 2025

View reviewed changes

MasterSkepticista added 3 commits May 21, 2025 11:58

Merge branch 'develop' into karansh1/batched_fetch

11f07b0

Set default True for require_lossless

01923ac

Signed-off-by: Shah, Karan <[email protected]>

Remove dead code

6bce94f

Signed-off-by: Shah, Karan <[email protected]>

MasterSkepticista force-pushed the karansh1/batched_fetch branch from 8b1b27b to 6bce94f Compare May 21, 2025 07:01

MasterSkepticista mentioned this pull request May 21, 2025

Hotstaging area #1474

Closed

MasterSkepticista force-pushed the karansh1/batched_fetch branch from a56ce95 to 6bce94f Compare May 22, 2025 05:37

ishaileshpant mentioned this pull request May 28, 2025

[WIP - Don't review]Patch REST get_aggregated_tensor api for batched changes #1653

Open

porteratzo mentioned this pull request May 28, 2025

Scaling issue with tensor_db #1593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batched fetching of tensors to reduce RPC calls [1/2] #1575

Batched fetching of tensors to reduce RPC calls [1/2] #1575

Uh oh!

MasterSkepticista commented Apr 30, 2025 •

edited

Loading

Uh oh!

teoparvanov left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kminhta left a comment

Uh oh!

kminhta commented May 18, 2025

Uh oh!

kminhta commented May 18, 2025

Uh oh!

Uh oh!

porteratzo left a comment

Uh oh!

teoparvanov left a comment

Uh oh!

Uh oh!

theakshaypant left a comment

Uh oh!

Uh oh!

Batched fetching of tensors to reduce RPC calls [1/2] #1575

Are you sure you want to change the base?

Batched fetching of tensors to reduce RPC calls [1/2] #1575

Uh oh!

Conversation

MasterSkepticista commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Example

Goal of this PR

Implementation

Cost/Benefit Analysis

Uh oh!

teoparvanov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kminhta left a comment

Choose a reason for hiding this comment

Uh oh!

kminhta commented May 18, 2025

Uh oh!

kminhta commented May 18, 2025

Uh oh!

Uh oh!

porteratzo left a comment

Choose a reason for hiding this comment

Uh oh!

teoparvanov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

theakshaypant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MasterSkepticista commented Apr 30, 2025 •

edited

Loading