|
| 1 | +.. # Copyright (C) 2025 Intel Corporation |
| 2 | +.. # SPDX-License-Identifier: Apache-2.0 |
| 3 | +
|
| 4 | +************************************* |
| 5 | +Verifiable Datasets and Data Sources |
| 6 | +************************************* |
| 7 | + |
| 8 | +.. _verifiable_datasets_overview: |
| 9 | + |
| 10 | +To accommodate for the proliferation of data sources and the need for trusted datasets, OpenFL provides a hierarchy of utility classes to build and verify datasets. |
| 11 | +This includes an extensible class hierarchy that enables the creation of datasets from various data sources, such as local file system, object storage and others. |
| 12 | + |
| 13 | +The central abstraction is the :code:`VerifiableDatasetInfo` class that encapsulates the dataset's metadata and provides a method for verifying the integrity of the dataset. |
| 14 | +A dataset can be built from multiple data sources (not necessarily from the same type): |
| 15 | + |
| 16 | +.. mermaid:: ../../mermaid/verifiable_dataset_info.mmd |
| 17 | + :caption: Verifiable Dataset with Multiple Data Sources |
| 18 | + :align: center |
| 19 | + |
| 20 | +The :code:`VerifiableDatasetInfo` class can then be used to create higher-order dataset classes that enable iterating through multiple data sources, while verifying integrity if required. |
| 21 | +The :code:`root_hash` is used as a reference for integrity when loading items from the the data sources in the :code:`VerifiableDatasetInfo` object. |
| 22 | + |
| 23 | +OpenFL comes with a toolbox of dataset layout classes per ML framework. For PyTorch's :code:`torch.utils.data.Dataset` OpenFL curently provides: |
| 24 | +- :code:`FolderDataset` - represents an iterable folder-layout dataset from a single data source, by implementing the :code:`__getitem__` method. |
| 25 | +- :code:`ImageFolder` - a specialization of the :code:`FolderDataset` that is able to load binary images from a foler-like structure |
| 26 | +- :code:`VerifiableMapStyleDataset` - a base class for map-style datasets that can be built from multiple data sources (as specified by a :code:`VerifiableDatasetInfo` object), including integrity checks. |
| 27 | +- :code:`VerifiableImageFolder` - a specialization of the :code:`VerifiableMapStyleDataset` encapsulating a collection of :code:`ImageFolder` datasets |
| 28 | + |
| 29 | +Note that the all those classes (directly or indirectly) extend :code:`torch.utils.data.DataLoader`, and are therefore compatible with all PyTorch utilities for pre-processing data sets. |
| 30 | +A similar class hierarchy can be created for other ML frameworks that offer dataset utilities, such as TensorFlow. |
| 31 | + |
| 32 | +.. mermaid:: ../../mermaid/verifiable_image_folder.mmd |
| 33 | + :caption: Dataset hierarchy |
| 34 | + :align: center |
| 35 | + |
| 36 | +A practical example for the :code:`VerifiableImageFolder` backed by a mix of :code:`LocalDataSource` and :code:`S3DataSource` objects is provided in the `s3_histology <https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/torch/histology_s3>`_ workspace template. |
0 commit comments