deel-datasets manager project aims to ease the installation and usage of self-hosted or public datasets. It is an open framework to manage any dataset through a plugin mechanism.
The latest release can be installed from pypi. All needed python packages will also be installed as a dependency.
pip install deel-datasetsThe package can also be installed directly from this GitHub repository via SSH (with SSH keys properly set up) or via HTTPS by entering your credentials manually:
# SSH version (with proper SSH key setup):
pip install git+ssh://[email protected]/deel-ai/deel_dataset_manager.git
# HTTPS version:
pip install git+https://github.com/deel-ai/deel_dataset_manager.gitThe configuration file specifies how the datasets should be downloaded, or if the datasets do not have to be downloaded (e.g. on Google Cloud).
It allows to define a list of datasets providers.
The configuration file should be by default at $HOME/.deel/config.yml:
- On Windows system it is
C:\Users\$USERNAME\.deel\config.ymlunless you have set theHOMEenvironment variable. - The
DEEL_CONFIGURATION_FILEenvironment variable can be used to specify the location of the configuration file if you do not want to use the default one.
The configuration file is a YAML file.
Two root nodes are mandatory in configuration file:
-
providers: (value = list of providers) -
path: local destination directory path (by default = ${HOME}/.deel/datasets)providers: |-provider1 |-provider2 . . |-providerN path: local destination path
providers is the root node of the provider configurations list.
Each child node of providers node define a provider configuration.
The name of child node is the name of the provider.
It may be used in command line to specify the provider (e.g., option -p for download).
Currently the following types of provider are implemented: webdav, ftp, http, local and gcloud.
-
The
webdavprovider will fetch datasets from a WebDAV server and needs at least theurlconfiguration parameter. The WebDAV provider supports basic authentication (see example below). If the datasets are not at the root of the WebDAV server, thefolderconfiguration can be used to specify the remote path (see example below). -
The
ftpprovider is similar to thewebdavprovider except that it will fetch datasets from a FTP server instead of a WebDAV one and needs at least theurlconfiguration parameter. -
The
localprovider does not require any extra configuration and will simply fetch data from the specifiedpath. Thecopyconfiguration (true or false) allows to specify if dataset must be copied frompathto destinationpathor not.copyis false by default. -
The
gcloudprovider is similar to thelocalprovider, except that it will try to locate the dataset storage location automatically based on a mounted drive. Thediskconfiguration parameter is mandatory and specify the name of the GCloud drive.
path parameter indicates where the datasets should be stored locally when using remote providers such as webdav, http or ftp provider.
Below is an example of a configuration for the DEEL dataset manager:
# Provider for the datasets - The names of the providers do not have to match
# their types:
providers:
# A GCloud provider with a shared GCloud drive named "my-disk-name".
gcloud:
type: gcloud
disk: my-disk-name
# A local storage at "/data/dataset".
local:
type: local
path: /data/dataset/
copy: true
# An FTP provider.
ftp:
type: ftp
# The "url" parameter contains the full path (server + folder):
url: ftp://<server_name>/<dataset path on ftp server>
# or folder to set the the path to dataset remote directory
# folder: <dataset path on ftp server>
# The "auth" is optional if the FTP server is public:
auth:
method: "simple"
username: "${username}"
password: "${password}"
# A public WebDAV server.
webdav_public:
type: webdav
url: https://my-public-webdav.com
# A private WebDAV server where the datasets are not at the root.
# Note: This example can be used with Cloud storage such as Nextcloud with
# a shared "datasets" drive.
webdav_private:
type: webdav
url: https://my-cloud-provider.com/remote.php/webdav
folder: datasets
auth:
method: "simple"
username: "${username}"
password: "${password}"
# The local path where datasets are stored when they are from a remote provider:
# by default ${HOME}/.deel/datasets
path: ${HOME}/.deel/datasetsYou can name a provider default to use it by-default.
When looking for a dataset, the dataset is first looked-up in the default provider,
then in other providers.
The manager tries providers one-by-one in the order they are declared until it finds
one providing the dataset.
To uninstall the deel dataset manager package , simply run pip uninstall:
pip uninstall deel-datasets
Without plugins, DEEL datasets manager is only able to download a dataset and returns the path to the local folder containing it (after download). By installing plugins, you gain access to automatic way of loading datasets or pre-processing data.
Plugins are available on PyPI and can be installed using pip. For example, for public
datasets:
pip install public_datasetsPlugins can also be installed from github.com/deel-ai. For example, you can install the public datasets:
# SSH version (with proper SSH key setup):
pip install git+ssh://[email protected]/deel-ai/public_datasets.git
# HTTPS version:
pip install git+https://github.com/deel-ai/public_datasets.gitFor DEEL project members, private plugins for DEEL project datasets are available on here.
They can browse here for the list of available datasets.
# SSH version (with proper SSH key setup):
pip install git+ssh://[email protected]:<port>/DevOps/datasets/all.git
# HTTPS version:
pip install git+https://forge.deel.ai/DevOps/datasets/all.gitTo load a dataset, you can simply do:
import deel.datasets
# Load the default mode of dataset-a dataset:
dataset-a-lpath = deel.datasets.load("dataset-a")
# dataset-b plugin is installed with tensorflow mode implemented
# Load the tensorflow version of the dataset-b dataset (default mode for dataset-b):
dataset-b-tf = deel.datasets.load("dataset-b", mode="tensorflow")
#or because tensorflow mode is default mode
dataset-b-tf = deel.datasets.load("dataset-b")
# If dataset-c plugin is installed with pytorch mode implemented,
# load the pytorch version of the dataset-b dataset:
dataset-c-pt = deel.datasets.load("dataset-c", mode="pytorch")The deel.datasets.load function is the basic entry to access the datasets.
By passing with_info=True, extra information can be retrieved as a python
dictionary. Information are not standardized, so each dataset may provide
different ones:
The mode argument can be used to load different "version" of the dataset. By default,
only the path mode is available and will return the path to the local folder
containing the dataset.
By installing plugins, new modes can be made available for each datasets (see plugin
implementation below).
import deel.datasets
# Load the tensorflow version of the dataset-b dataset:
dataset-b-lpath, info = deel.datasets.load("dataset-b", mode="tensorflow", with_info=True)
print(info["classes"])The function can take extra parameters depending on the chosen dataset and mode,
for instance, you can specify the percentage of training data for the blink
dataset:
import deel.datasets
# Load the tensorflow version of the blink dataset:
blink = deel.datasets.load("blink", mode="tensorflow", percent_train=60)The deel-datasets package comes with some command line utilities that can be accessed using:
python -m deel.datasets ARGS...
The --help option can be used to view the full capabilities of the command line program.
By default, the program uses the configuration at $HOME/.deel/config.yml, but the -c
argument can be used to specified a custom configuration file.
The following commands are available (not exhaustive):
list— List the available datasets on all providers in configuration file. If the-poption is used to specify a provider, this will list the datasets available on it.
To list the dataset already downloaded, you can use the --local option.
$ python -m deel.datasets list
Listing datasets at https://datasets.server1:
dataset-a: 3.0.1 [latest], 3.0.0
dataset-b: 1.0 [latest]
dataset-c: 1.0 [latest]
$ python -m deel.datasets list --local
Listing datasets at ${HOME}/.deel/datasets:
dataset-a: 3.0.1 [latest], 3.0.0
dataset-c: 1.0 [latest]download NAME[:VERSION]— Download the specified dataset. If the configuration does not specify a remote provider, this does nothing except outputing some information. The:VERSIONcan be omitted, in which case:latestis implied. To force the re-download of a dataset, the--forceoption can be used.
$ python -m deel.datasets download dataset-a:3.0.0
Fetching dataset-a:3.0.0...
dataset-a-3.0.0-20191004.zip: 100%|█████████████████████████████████████████| 122M/122M [00:03<00:00, 39.3Mbytes/s]
Dataset dataset-a:3.0.0 stored at '${HOME}/.deel/datasets/dataset-a/3.0.0'.remove NAME[:VERSION]— Remove the specified dataset from the local storage (if possible). If:VERSIONis omitted, the whole dataset corresponding toNAMEis deleted. If the--alloption is used, all datasets are removed from the local storage.
A deel dataset plugin is an extension of the Dataset class defined in the DEEL dataset manager project.
It allows to access to specific dataset files using the load method of defined modes.
Below is an example implementation of a dataset class ExampleDataset.
The load_XXX methods defines the various mode, e.g. load_pytorch adds
a pytorch mode to the dataset.
The default mode used (when none is specified) can be set using the
_default_mode class attribute.
import h5py
import pathlib
import typing
from deel.datasets.dataset import Dataset
from deel.datasets.settings import Settings
class ExampleDataset(Dataset):
# Default mode:
_default_mode: str = "numpy"
def __init__(
self, version: str = "latest", settings: typing.Optional[Settings] = None
):
"""
Args:
version: Version of the dataset.
settings: The settings to use for this dataset, or `None` to use the
default settings.
"""
# `data_name` is the name of the folder containing the dataset on the
# provider (remote or local).
super().__init__("data_name", version, settings)
def load_numpy(self, path: pathlib.Path):
"""
Numpy mode for this dataset.
"""
# Dataset-specific code:
return data
def load_csv(self, path: pathlib.Path):
"""
CSV mode for this dataset.
"""
import pandas as pd
return pd.read_csv(path, sep=";", index_col=0)
def load_pytorch(
self,
path: pathlib.Path,
nstack: int = 4,
transform: typing.Callable = None,
):
"""
Pytorch mode for this dataset. With extra arguments that can
be passed to the `deel.datasets.load` method using named parameters.
"""
from .torch import SourceDataSet
return SourceDataSet(self.load_path(path), nstack, transform)By default, the with_info option will return a dictionary containing the name and
the version of the dataset.
If you want to provide extra information, you can return a dictionary from the load_XXX methods, e.g.:
def load_pytorch(self, path: pathlib.Path):
# Load a pytorch dataset:
dataset = ...
return dataset, {"classes": ["foo", "bar"]}-
The
deel.datasets.utilspackage contains utility functions to load or splitnumpy,pytorchandtensorflowimage dataset in a consistent way. -
The
Datasetclass contains some utility methods to generate the information dictionary from the return of these methods.
Here is a very simple example for loading a dataset:
def load_pytorch(self, path: pathlib.Path, image_size: Tuple[int, int]):
# Use relative import only if you are inside the deel package:
from ..utils import load_pytorch_image_dataset
# Load the dataset using the utility function:
dataset, idx_to_class = load_pytorch_image_dataset(
self.load_path(path), # This is require only if `load_path` modifies the path.
image_size=image_size,
train_split=.7,
)
# The `_make_class_info` is provided by `Dataset`:
return dataset, self._make_class_info(idx_to_class)To be found by the dataset manager, the ExampleDataset class must be put in
a package with a specific entrypoint (defined in setup.py).
The entry point provides to the plugin to be discovered and used by DEEL dataset
manager project.
The name of the DEEL dataset manager entry point is unique:
plugins.deel.dataset.
It is possible to define many aliases for the same plugin by adding multiple
alias = package:plugin class entries in entry points list.
# Assuming `ExampleDataset` is in `my_dataset/__init__.py`:
from setuptools import setup
setup(
# Other `setup` arguments:
...
# Entry points:
entry_points={
"plugins.deel.dataset": [
"example = my_dataset:ExampleDataset",
"my_dataset.example = my_dataset:ExampleDataset"
]
}
)A single plugin can expose multiple datasets through different entry points.
Contributions are welcome! Feel free to report an issue or open a pull request. Take a look at our guidelines here.
Copyright IRT Antoine de Saint Exupéry et Université Paul Sabatier Toulouse III - All rights reserved. DEEL is a research program operated by IVADO, IRT Saint Exupéry, CRIAQ and ANITI - https://www.deel.ai/
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This project received funding from the French "Investing for the Future – PIA3" program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL project.