Skip to content

Commit 735e593

Browse files
authored
Documentation for verifiable datasets and data sources (#1636)
* Verifiable data set and data sources overview doc Signed-off-by: Teodor Parvanov <[email protected]> * README.md for the histology_s3 workspace template Signed-off-by: Teodor Parvanov <[email protected]> * Adding copyright headers to the Mermaid files Signed-off-by: Teodor Parvanov <[email protected]> * Mermaid syntax fixes Signed-off-by: Teodor Parvanov <[email protected]> * Added styling to the class diagrams Signed-off-by: Teodor Parvanov <[email protected]> * Moved the verifiable dataset page above the data splitters Signed-off-by: Teodor Parvanov <[email protected]> * Cosmetic fix Signed-off-by: Teodor Parvanov <[email protected]> * Addressing review comments Signed-off-by: Teodor Parvanov <[email protected]> --------- Signed-off-by: Teodor Parvanov <[email protected]>
1 parent a08dacf commit 735e593

File tree

5 files changed

+179
-1
lines changed

5 files changed

+179
-1
lines changed

docs/developer_guide/utilities.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ The following are utilities available in Open Federated Learning (OpenFL).
66
:doc:`utilities/pki`
77
Use the Public Key Infrastructure (PKI) solution workflows to certify the nodes in your federation.
88

9+
:doc:`utilities/verifiable_datasets`
10+
Build and verify datasets composed of multiple data sources.
11+
912
:doc:`utilities/splitters_data`
1013
Split your data to run your federation from a single dataset.
1114

@@ -17,5 +20,7 @@ The following are utilities available in Open Federated Learning (OpenFL).
1720
:hidden:
1821

1922
utilities/pki
23+
utilities/verifiable_datasets
2024
utilities/splitters_data
21-
utilities/timeouts
25+
utilities/timeouts
26+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
.. # Copyright (C) 2025 Intel Corporation
2+
.. # SPDX-License-Identifier: Apache-2.0
3+
4+
*************************************
5+
Verifiable Datasets and Data Sources
6+
*************************************
7+
8+
.. _verifiable_datasets_overview:
9+
10+
To accommodate for the proliferation of data sources and the need for trusted datasets, OpenFL provides a hierarchy of utility classes to build and verify datasets.
11+
This includes an extensible class hierarchy that enables the creation of datasets from various data sources, such as local file system, object storage and others.
12+
13+
The central abstraction is the :code:`VerifiableDatasetInfo` class that encapsulates the dataset's metadata and provides a method for verifying the integrity of the dataset.
14+
A dataset can be built from multiple data sources (not necessarily from the same type):
15+
16+
.. mermaid:: ../../mermaid/verifiable_dataset_info.mmd
17+
:caption: Verifiable Dataset with Multiple Data Sources
18+
:align: center
19+
20+
The :code:`VerifiableDatasetInfo` class can then be used to create higher-order dataset classes that enable iterating through multiple data sources, while verifying integrity if required.
21+
The :code:`root_hash` is used as a reference for integrity when loading items from the the data sources in the :code:`VerifiableDatasetInfo` object.
22+
23+
OpenFL comes with a toolbox of dataset layout classes per ML framework. For PyTorch's :code:`torch.utils.data.Dataset` OpenFL curently provides:
24+
- :code:`FolderDataset` - represents an iterable folder-layout dataset from a single data source, by implementing the :code:`__getitem__` method.
25+
- :code:`ImageFolder` - a specialization of the :code:`FolderDataset` that is able to load binary images from a foler-like structure
26+
- :code:`VerifiableMapStyleDataset` - a base class for map-style datasets that can be built from multiple data sources (as specified by a :code:`VerifiableDatasetInfo` object), including integrity checks.
27+
- :code:`VerifiableImageFolder` - a specialization of the :code:`VerifiableMapStyleDataset` encapsulating a collection of :code:`ImageFolder` datasets
28+
29+
Note that the all those classes (directly or indirectly) extend :code:`torch.utils.data.DataLoader`, and are therefore compatible with all PyTorch utilities for pre-processing data sets.
30+
A similar class hierarchy can be created for other ML frameworks that offer dataset utilities, such as TensorFlow.
31+
32+
.. mermaid:: ../../mermaid/verifiable_image_folder.mmd
33+
:caption: Dataset hierarchy
34+
:align: center
35+
36+
A practical example for the :code:`VerifiableImageFolder` backed by a mix of :code:`LocalDataSource` and :code:`S3DataSource` objects is provided in the `s3_histology <https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/torch/histology_s3>`_ workspace template.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
%% Copyright 2025 Intel Corporation
2+
%% SPDX-License-Identifier: Apache-2.0
3+
4+
classDiagram
5+
class VerifiableDatasetInfo {
6+
+label: str
7+
+data_sources: DataSource[]
8+
+metadata: dict[str, str]
9+
+root_hash: HASH
10+
11+
+verify_dataset(root_hash: HASH)
12+
+verify_single_file(file_path: str, file_hash: HASH)
13+
+to_json() str
14+
+from_json(json_str: str) VerifiableDatasetInfo
15+
}
16+
17+
class DataSource {
18+
<<abstract>>
19+
+name: str
20+
+type: DataSourceType
21+
+compute_file_hash(path: str) str
22+
+enumerate_files() Generator~str~
23+
+read_blob(path: str) bytes
24+
+from_dict(ds_dict: dict) DataSource
25+
+is_valid_hash_function(func) bool
26+
+to_dict() dict
27+
}
28+
29+
class LocalDataSource {
30+
+base_path: str
31+
...
32+
}
33+
34+
class S3DataSource {
35+
+uri: str
36+
+endpoint: str
37+
...
38+
}
39+
40+
class AzureBlobDataSource {
41+
+name: str
42+
+container_string: str
43+
+folder_prefix: str
44+
...
45+
}
46+
47+
VerifiableDatasetInfo "1" o-- "*" DataSource
48+
DataSource <|-- LocalDataSource
49+
DataSource <|-- S3DataSource
50+
DataSource <|-- AzureBlobDataSource
51+
52+
style VerifiableDatasetInfo fill:#FFFFE0,stroke:#000,stroke-width:1px
53+
style DataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
54+
style LocalDataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
55+
style S3DataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
56+
style AzureBlobDataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
%% Copyright 2025 Intel Corporation
2+
%% SPDX-License-Identifier: Apache-2.0
3+
4+
classDiagram
5+
class torch_utils_data_Dataset {
6+
+__len__() int
7+
+__getitem__(index: int) Any
8+
}
9+
10+
class VerifiableDatasetInfo {
11+
+verify_dataset(root_hash: HASH)
12+
+verify_single_file(file_path: str, file_hash: HASH)
13+
+from_json(json_str: str) VerifiableDatasetInfo
14+
}
15+
16+
class VerifiableMapStyleDataset {
17+
<<abstract>>
18+
+__len__() int
19+
+__getitem__(index: int) Any
20+
+create_datasets() void*
21+
}
22+
23+
class VerifiableImageFolder {
24+
+__len__() int
25+
+__getitem__(index: int) Any
26+
+create_datasets() void
27+
}
28+
29+
class FolderDataset {
30+
<<abstract>>
31+
+__len__() int
32+
+__getitem__(index: int) Any
33+
+load_file(file_path: str) void*
34+
}
35+
36+
class ImageFolder {
37+
+__len__() int
38+
+__getitem__(index: int) Any
39+
+load_file(file_path: str) void
40+
}
41+
42+
torch_utils_data_Dataset <|.. VerifiableMapStyleDataset
43+
torch_utils_data_Dataset <|.. FolderDataset
44+
VerifiableMapStyleDataset o-- VerifiableDatasetInfo
45+
VerifiableMapStyleDataset <|-- VerifiableImageFolder
46+
VerifiableMapStyleDataset o-- FolderDataset
47+
FolderDataset <|-- ImageFolder
48+
49+
style torch_utils_data_Dataset fill:#D3D3D3,stroke:#000,stroke-width:1px
50+
style VerifiableDatasetInfo fill:#FFFFE0,stroke:#000,stroke-width:1px
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
## Overview
2+
This workspace template showcases the use of `VerifiableImageFolder` among a federation of two collaborators with a diverse mix of data sources (`LocalDataSource` and `S3DataSource`). The data source JSON descriptors are located under `./data/collaborator1` and `./data/collaborator2`. In a distributed setup, those would be located on separate machines at each collaborator's premises, to make the entire experiment executable locally.
3+
4+
It is important to note that the integrity verification is done based on a hash of each dataset that is calculated _prior_ to the experiment via the `fx collaborator calchash` command. Doing so provides protection against various data integrity attacks that could occur between the dataset preparation and the actual federated learning process.
5+
6+
## Steps to run the experiment
7+
1. Set up a MinIO server which hosts the S3 data sources from the JSON descriptors
8+
2. Configure the credentials
9+
```shell
10+
export YOUR_ACCESS_KEY=<your_access_key>
11+
export YOUR_SECRET_KEY=<your_secret_key>
12+
```
13+
14+
For example:
15+
```shell
16+
export MINIO_ROOT_PASSWORD=minioadmin
17+
export MINIO_ROOT_USER=minioadmin
18+
```
19+
3. Create a workspace folder from the `histology_s3` template
20+
```shell
21+
fx workspace create --prefix ~/hist_s3 --template torch/histology_s3
22+
cd ~/hist_s3/
23+
pip install -r requirements.txt
24+
```
25+
4. Optional: Calculate the dataset hashes for both collaborators.
26+
The hash will be saved at `<data_path>/hash.txt`. Later on, when the Federated Learning process starts, the data loader will check for this file and verify the dataset's integrity against the hash stored inside:
27+
```shell
28+
fx collaborator calchash --data_path plan/collaborator1
29+
fx collaborator calchash --data_path plan/collaborator2
30+
```
31+
5. For the rest of the steps, follow the [quickstart guide](https://openfl.readthedocs.io/en/latest/tutorials/taskrunner.html)

0 commit comments

Comments
 (0)