-
Notifications
You must be signed in to change notification settings - Fork 232
Batched fetching of tensors to reduce RPC calls [1/2] #1575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
cef758c
to
19c29e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments from my first read-through:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this up @MasterSkepticista - this is a great fundamental adjustment to how we transport the model across the network. I don't see any major concerns, I'm going to go ahead and approve this after my review and our offline discussion with the understanding that:
- delta updates get reintroduced in the follow up PR
- streaming gets reintroduced in a more generic manner (as you called out in #1575 (comment)) in some subsequent PR as well - or otherwise some manner of handling large models from both the aggregator and collaborator
I know you called out point 1 in the PR, but do you have a mental model for how point 2 can be achieved or is the general implementation still an open? I ask mainly because it would be good to close the gap for first class LLM support (thinking in terms of pretraining), but it is also true that LLM fine-tuning has a large library of parameter efficient methods that may still allow for expanding OpenFL-LLM capabilities in the interim
FYI, a couple of the tests are failing - secure aggregation in particular https://github.com/securefederatedai/openfl/actions/runs/15063524782/job/42343165534?pr=1575 |
Also, going to tag @ishaileshpant to keep him in the loop as there are modifications to RPC calls that may affect #1500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cdbe2d6
to
dbb6eca
Compare
Signed-off-by: payalcha <[email protected]> Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]> Co-authored-by: Payal Chaurasiya <[email protected]> Signed-off-by: Shah, Karan <[email protected]>
dbb6eca
to
4ae2b31
Compare
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
34949fd
to
fcbfff4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @MasterSkepticista !
Signed-off-by: Shah, Karan <[email protected]>
…eratedai/openfl into karansh1/batched_fetch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I do think we wait for 1.9 cut-off before merging this given the temporary removal of use_delta_updates.
Signed-off-by: Shah, Karan <[email protected]>
Signed-off-by: Shah, Karan <[email protected]>
8b1b27b
to
6bce94f
Compare
a56ce95
to
6bce94f
Compare
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases Signed-off-by: Shailesh Pant <[email protected]>
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <[email protected]>
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <[email protected]>
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <[email protected]>
Motivation
Collaborators today fetch each model tensor from the aggregator via a dedicated RPC call. Aggregator has limited resources to serve requests, and it is not uncommon to have hundreds (if not thousands) of model tensors waiting to be served to each collaborator. A federation itself may have hundreds (if not thousands) of participants making these requests.
Example
Consider
torch/histology
experiment.(get_tasks + 3 x send_local_task_results) x 8 collaborators
32 RPC calls are made for other purposes in each round.Problem gets worse when models have hundreds of tensors and experiments span more collaborators.
Goal of this PR
This is Part 1 (of 2) PRs that adds support for batched fetching of tensors over gRPC. This PR makes the following major changes:
send_model_deltas
anduse_delta_updates
support.nparray_to_named_tensor
(henceforth calledserialize_tensor
) andnamed_tensor_to_nparray
(henceforth calleddeserialize_tensor
) without any brittle conditional checks.In Part 2 of this PR, delta update support will be added back. The reason for removal is to straighten the logic for actions that concern a communication pipeline (de/compression, de/serialization) and the aggregator/collaborator component.
Implementation
A new RPC call replaces the
get_aggregated_tensor
withget_aggregated_tensors
(plural 's'). Collaborators request a batch of tensors in a newTensorSpec
format.You may observe the resemblance with
TensorKey
, except that origin field is missing.The RPC request and response formats are shown below:
On collaborator side, tensors are fetched via
self.fetch_tensors_from_aggregator(tensor_keys)
wheretensor_keys
are of typeList[TensorKey]
.Cost/Benefit Analysis
One downside of this PR is that potential bandwidth savings achieved due to delta updates is lost.
The upside of this PR is a significantly simplified mental model of how tensors are processed on both ends, higher correctness confidence and higher maintainability.
Indeed, the second part will close the drawback by bringing in delta updates, keeping the mental model simple.
Reviewer note(s): This PR has a lot of cleanup. It is best understood by looking at the proposed changes directly and not diffs.