Skip to content
Open
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
3373551
Merge relative tensorkeys loop
MasterSkepticista Apr 15, 2025
4d1ae93
Remove round_number field: never used, also named_tensor has it
MasterSkepticista Apr 15, 2025
f6ac53e
Disable streaming on send task results for consistency
MasterSkepticista Apr 30, 2025
8ca8d46
Task callback rebase
MasterSkepticista Apr 30, 2025
3449009
formatting
MasterSkepticista Apr 30, 2025
5da4cba
Raise exception if tensor not found
MasterSkepticista Apr 30, 2025
f7a1c0f
Merge branch 'develop' into karansh1/batched_fetch
MasterSkepticista Apr 30, 2025
d0f1345
Fix line too long
MasterSkepticista Apr 30, 2025
3d6922a
Formatting
MasterSkepticista Apr 30, 2025
5912127
Batched fetch
MasterSkepticista Apr 30, 2025
2722e23
Merge branch 'develop' into karansh1/batched_fetch
MasterSkepticista May 13, 2025
9149a17
MR changes
MasterSkepticista May 13, 2025
5c149d2
Merge branch 'develop' into karansh1/batched_fetch
MasterSkepticista May 15, 2025
c915b16
Simplify collaborator side send/receive with no delta calculations
MasterSkepticista May 15, 2025
19c29e2
Simplify aggregator with no deltas and simple compress/decompress
MasterSkepticista May 15, 2025
b13f19e
Add sleep between collab start (#1628)
payalcha May 15, 2025
1bfbe6d
Fix image shape (#1626)
MasterSkepticista May 15, 2025
be498d7
Edar/datasets in experiment (#1576)
Efrat1 May 15, 2025
b993310
remove mnist_fed_eval workspace (#1587)
kminhta May 15, 2025
ff7af4f
Use common serialize/deserialize functions for both components
MasterSkepticista May 16, 2025
8317d43
Revert "Disable streaming on send task results for consistency"
MasterSkepticista May 19, 2025
ff86573
Update RPC call
MasterSkepticista May 19, 2025
4ae2b31
Formatting
MasterSkepticista May 19, 2025
2c413d7
Update RPC docstring
MasterSkepticista May 19, 2025
fcbfff4
Update infosec-overview.md
MasterSkepticista May 19, 2025
d76cec1
Merge branch 'develop' into karansh1/batched_fetch
MasterSkepticista May 19, 2025
31ec500
Update tests
MasterSkepticista May 20, 2025
f18aeb1
Merge branch 'karansh1/batched_fetch' of https://github.com/securefed…
MasterSkepticista May 20, 2025
fbb4fb7
Merge branch 'develop' into karansh1/batched_fetch
payalcha May 20, 2025
11f07b0
Merge branch 'develop' into karansh1/batched_fetch
MasterSkepticista May 21, 2025
01923ac
Set default True for require_lossless
MasterSkepticista May 21, 2025
6bce94f
Remove dead code
MasterSkepticista May 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 30 additions & 22 deletions docs/infosec-orverview.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,41 @@
# InfoSec Overview

## Purpose
_Last updated: 19 May 2025_

### Purpose
This document provides the information needed when evaluating OpenFL for real world deployment in highly sensitive environments. The target audience is InfoSec reviewers who need detailed information about code contents, communication traffic, and potential exploit vectors.

## Network Connectivity Overview
OpenFL federations use a hub-and-spoke topology between _collaborator_ clients that generate model parameter updates from their data and the _aggregator_ server that combines their training updates into new models [[ref](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html)]. Key details about this functionality are:
* Connections are made using request/response gRPC connections [[ref](https://grpc.io/docs/what-is-grpc/core-concepts/)].
* The _aggregator_ listens for connections on a single port (usually decided by the experiment admin), and is explicitly defined in the FL plan (f.e. `50051`), so all _collaborators_ must be able to send outgoing traffic to this port.
* All connections are initiated by the _collaborator_, i.e., a `pull` architecture [[ref](https://karlchris.github.io/data-engineering/data-ingestion/push-pull/#pull)].
* The _collaborator_ does not open any listening sockets.
* Connections are secured using mutually-authenticated TLS [[ref](https://www.cloudflare.com/learning/access-management/what-is-mutual-tls/)].
### Overview: Network Connectivity
OpenFL federations use a hub-and-spoke topology between `collaborator` clients that generate model parameter updates from their data and the `aggregator` server that combines their training updates into new models[^1]. Key details about this functionality are:

* Connections are made using request/response gRPC[^2] connections.
* The `aggregator` listens for connections on a single port (usually decided by the experiment admin), and is explicitly defined in the FL plan (f.e. `50051`), so all `collaborator`s must be able to send outgoing traffic to this port.
* All connections are initiated by the `collaborator`, i.e., a [`pull`](https://karlchris.github.io/data-engineering/data-ingestion/push-pull/#pull) architecture.
* The `collaborator` does not open any listening sockets.
* Connections are secured using mTLS[^3].
* Each request response pair is done on a new TLS connection.
* The PKI for federations can be created using the [OpenFL CLI](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html#step-2-configure-the-federation). OpenFL internally leverages Python's cryptography module. The organization hosting the _aggregator_ usually acts as the Certificate Authority (CA) and verifies each identity before signing.
* Currently, the _collaborator_ polls the _aggregator_ at a fixed interval. We have had a request to enable client-side configuration of this interval and hope to support that feature soon.
* The PKI for federations is created using the [`fx aggregator/collaborator certify`](https://openfl.readthedocs.io/en/latest/fx.html) CLI command. OpenFL internally leverages Python's cryptography module. The organization hosting the `aggregator` usually acts as the Certificate Authority (CA) and verifies each identity before signing.
* Currently, the `collaborator` polls the `aggregator` at a fixed interval. We have had a request to enable client-side configuration of this interval and hope to support that feature soon.
* Connection timeouts are set to gRPC defaults.
* If the _aggregator_ is not available, the _collaborator_ will retry connections indefinitely. This is currently useful so that we can take the aggregator down for bugfixes without _collaborator_ processes exiting.
* If the `aggregator` is not available, the `collaborator` will retry connections indefinitely. This is currently useful so that we can take the aggregator down for bugfixes without `collaborator` processes exiting.

## Overview of Contents of Network Messages
### Contents of Network Messages
Network messages are well defined protobufs which can be found in the following files:
- [aggregator.proto](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto)
- [base.proto](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto)
- [`aggregator.proto`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto)
- [`base.proto`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto)

Key points about the network messages/protocol:
* No executable code is ever sent to the collaborator. All code to be executed is contained within the OpenFL package and the custom FL workspace. The code, along with the FL plan file that specifies the classes and initial parameters to be used, is available for review prior to the FL plans execution. This ensures that all potential operations are understood before they take place.
* The _collaborator_ typically requests the FL tasks to execute from the aggregator via the `GetTasksRequest` message [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L34)]
* The _aggregator_ reads the FL plan and returns a `GetTasksResponse` [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L45)] which includes metadata (`Tasks`) [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L38)] about the Python functions to be invoked by the collaborator (the code being installed locally as part of a pre-distributed workspace bundle)
* The _collaborator_ then uses its TaskRunner framework to execute the FL tasks on the locally available data, producing output tensors such as model weights or metrics
* During task execution, the _collaborator_ may additionally request tensors from the aggregator via the `GetAggregatedTensor` RPC method [[ref](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.GetAggregatedTensor)]
* Upon task completion, the _collaborator_ transmits the results by emitting a `SendLocalTaskResults` call [[ref](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.SendLocalTaskResults)] which contains `NamedTensor` [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto#L11)] objects that encode model weight updates or ML metrics such as loss or accuracy (among others).

## Testing a Collaborator
There is a "no-op" workspace template in OpenFL (available in versions `>=1.9`) which can be used to test the network connection between the _aggregator_ and each _collaborator_ without performing any computational task. More details can be found [here](https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/no-op#overview).
* The `collaborator` typically requests the FL tasks to execute from the aggregator via a [`GetTasksRequest`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L34) message.
* The `aggregator` based on the FL plan, returns a [`GetTasksResponse`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L45) which includes [`Tasks`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L38) - metadata about the Python functions to be invoked by the collaborator. All code is available locally to each collaborator as part of a pre-distributed workspace bundle.
* The `collaborator` then uses its TaskRunner framework to execute the FL tasks on the locally available data, producing output tensors such as model weights or metrics.
* During task execution, the `collaborator` may require certain tensors for task execution that are not available locally. For example, a federated training task requires globally averaged model weights from the `aggregator`. Collaborators gather a list of tensor keys that need to be fetched from the aggregator and download them via the [`GetAggregatedTensors`](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.GetAggregatedTensor) RPC method.
* Upon task completion, the `collaborator` transmits the results by emitting a [`SendLocalTaskResults`](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.SendLocalTaskResults) RPC method which contains [`NamedTensor`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto#L11) objects that encode results (like model weight updates or metrics such as loss or accuracy).

### Testing a Collaborator
There is a "no-op" workspace template in OpenFL (available in versions `>=1.9`) which can be used to test the network connection between the `aggregator` and each `collaborator` without performing any computational task. More details can be found [here](https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/no-op#overview).


[^1]: [OpenFL TaskRunner Overview](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html)
[^2]: [gRPC Overview](https://grpc.io/docs/what-is-grpc/core-concepts/)
[^3]: [mTLS Overview](https://www.cloudflare.com/learning/access-management/what-is-mutual-tls/)
2 changes: 1 addition & 1 deletion openfl-workspace/gandlf_seg_test/src/dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,4 +39,4 @@ def get_feature_shape(self):
"""
# Define a fixed feature shape for this specific application
# Use standard 3D patch size for medical imaging segmentation
return self.feature_shape
return self.feature_shape
3 changes: 2 additions & 1 deletion openfl/callbacks/secure_aggregation.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,5 +295,6 @@ def _fetch_from_aggregator(self, key_name):
Returns:
bytes: The aggregated tensor data in bytes.
"""
tensor = self.client.get_aggregated_tensor(key_name, -1, False, ("secagg",), True)
key = TensorKey(key_name, self.name, -1, False, ("secagg",))
tensor = self.client.get_aggregated_tensors([key], require_lossless=True)[0]
return json.loads(tensor.data_bytes)
Loading
Loading