Dataset Conversion Library

This library provides tools for efficient conversion and reading of large image datasets. It supports multiple formats (e.g., Memmap, HDF5, Parquet) and is designed for both fixed-size and flexible-size datasets. With distributed processing capabilities, it ensures fast and scalable dataset preparation for machine learning workflows.

Features

Multi-format Support:
- Fixed-size: Memory-mapped (Memmap), HDF5
- Flexible-size: Parquet, Tar, PyArrow IPC (Inter-Process Communication)
Dataset Compatibility:
- Supports conversion from image folders (e.g., ImageNet1k structure) or any torch.Dataset.
Distributed Processing:
- Accelerates dataset conversion using distributed processing.
Custom Data Loaders:
- Efficient reading of converted datasets with PyTorch-compatible loaders.
Image Format Support:
- Handles various image formats and modes (e.g., RGB, grayscale).

Main Components

Converters:
- MemmapConverter, HDF5Converter, ParquetConverter, IpcConverter, TarConverter
Readers:
- MemmapReader, HDF5Reader, ParquetReader, IpcReader, TarReader
Utilities:
- Custom samplers, preprocessing tools, and distributed processing decorators.

Installation

Local Setup

Ensure Python 3 is installed on your system.

Clone this repository:

git clone https://github.com/your-repo/dataloader_bachelor_project.git
cd dataloader_bachelor_project

Run the setup script:
```
bash setup_local.sh
```
This will:
- Create a virtual environment (dataloadenv).
- Install required dependencies (e.g., PyTorch, torchvision, tqdm, etc.).
Activate the virtual environment:
```
source dataloadenv/bin/activate
```

Cluster Setup

Connect to your cluster and navigate to the project directory.
Run the cluster setup script:
```
source setup_cluster.sh
```
This will:
- Load necessary modules (e.g., PyTorch, CUDA).
- Create and activate a virtual environment.
- Install required dependencies.

Note: The setup_cluster.sh script uses the ml command to load modules. Adjust it based on your cluster's configuration.

Usage

You can provide either:

Folder Structure

The input dataset must be organized into a root directory with one subfolder per class. Each subfolder contains the images belonging to that class. Example:

root/
  ├── class1/
  │     ├── img1.jpg
  │     ├── img2.jpg
  ├── class2/
  │     ├── img3.jpg
  │     ├── img4.jpg

The folder name is used as the class label.
Supported image formats: .jpg, .png, etc.

Pytorch Dataset

The library also supports input as a PyTorch torch.utils.data.Dataset object. However, when using a FixedSizeConverter (e.g., MemmapConverter, HDF5Converter), preprocessing is required to ensure that all images have the same size and mode (e.g., RGB).

In this case, it is mandatory to apply the provided ResizeAndConvert transformation:

from utils.utils import ResizeAndConvert

transform = ResizeAndConvert(size=(224, 224), mode="RGB")

This guarantees that the dataset is compatible with fixed-size storage formats.

Fixed-Size Datasets

Example: Using `MemmapConverter`

from datasets.memmap.memmap_converter import MemmapConverter

# Initialize the converter
converter = MemmapConverter(
    input_data="path/to/your/dataset",
    output_path="path/to/output",
    images_per_file=1000,
    batch_size=100,
    num_workers=4,
    shape=(224, 224),
    im_mode="RGB"
)

# Convert the dataset
converter.convert()

Notes:

Ensure the batch size divides evenly into the number of images per file.

Example: Using a Torch Dataset

from datasets.memmap.memmap_converter import MemmapConverter
from utils.utils import ResizeAndConvert
from torchvision.datasets import MNIST

# Prepare the dataset
dataset = MNIST(
    root="path/to/mnist",
    train=True,
    download=True,
    transform=ResizeAndConvert((224, 224), "RGB")
)

# Initialize the converter
converter = MemmapConverter(
    input_data=dataset,
    output_path="path/to/output",
    images_per_file=1000,
    batch_size=100,
    num_workers=4,
    shape=(224, 224),
    im_mode="RGB"
)

# Convert the dataset
converter.convert()

Flexible-Size Datasets

Example: Using `IpcConverter`

from datasets.pyarrow_ipc.pyarrow_ipc_converter import IpcConverter

# Initialize the converter
converter = IpcConverter(
    input_data="path/to/your/dataset",
    output_path="path/to/output",
    images_per_file=1000,
    batch_size=32,
    num_workers=4
)

# Convert the dataset
converter.convert()

Note: For flexible-size datasets, image size and type are automatically inferred from the original dataset.

Reading Converted Datasets

Example: Using `MemmapReader`

from datasets.memmap.memmap_reader import MemmapReader
from torch.utils.data import DataLoader
from torchvision import transforms

# Define transformations
transform = transforms.Compose([
    transforms.Resize((28, 28)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Initialize the reader
reader = MemmapReader(dataset_path="path/to/converted/dataset")

# Create a DataLoader
dataloader = DataLoader(
    reader,
    batch_size=32,
    num_workers=4
)

# Iterate through the dataset
for images, labels in dataloader:
    # Your processing code here
    pass

Note: On Windows, use the worker_init_fn method to handle file handler pickling:

dataloader = DataLoader(
    reader,
    batch_size=32,
    num_workers=4,
    worker_init_fn=MemmapReader.worker_init_fn
)

Distributed Conversion

Python Script

import torch.distributed as dist

def main():
    rank = int(os.environ['SLURM_PROCID'])
    world_size = int(os.environ['SLURM_NPROCS'])
    
    dist.init_process_group('nccl', world_size=world_size, rank=rank)
    # Initialize the MemmapReader with distributed support
    dataset = MemmapConverter(dataset_path="path/to/dataset", dist=True, **kwargs)

Batch Script

#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=12
#SBATCH --gres=gpu:0

MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
MASTER_ADDR="${MASTER_ADDR}i"
export MASTER_ADDR="$(nslookup "$MASTER_ADDR" | grep -oP '(?<=Address: ).*')"
export MASTER_PORT=6000

Extending the Library

To add support for new formats:

Subclass ConverterFixedSize or ConverterFlexibleSize.
Implement the _custom_collate_fn method for preprocessing batches.
Define methods for saving data to disk (_write_data_to_disk) or initializing arrays (_init_arrays).

Example: Extention to Dense Segmentation Labels

To support datasets with dense segmentation labels (e.g., per-pixel class annotations), we provide an example extension of the library. We created two new subclasses:

FixedSizeConverter:
- DenseMemmapConverter
- DenseMemmapReader
FlexibleSizeConverter
- DenseIpcConverter
- DenseIpcReader

These classes are adapted versions of the standard MemmapConverter and MemmapReader, with the following changes:

Label Shape:

Instead of storing labels as single integers (shape: (num_instances,)), dense labels are stored as arrays with shape (num_instances, height, width, channels).

Target Transform Support:

A target_transform can be applied to the labels during loading to handle preprocessing (e.g., resizing, normalization).

Usage Examples

Refer to the examples directory for detailed use cases:

Caltech256: Local testing.
ImageNet1k: Cluster environments.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset Conversion Library

Features

Main Components

Installation

Local Setup

Cluster Setup

Usage

Folder Structure

Pytorch Dataset

Fixed-Size Datasets

Example: Using `MemmapConverter`

Example: Using a Torch Dataset

Flexible-Size Datasets

Example: Using `IpcConverter`

Reading Converted Datasets

Example: Using `MemmapReader`

Distributed Conversion

Python Script

Batch Script

Extending the Library

Example: Extention to Dense Segmentation Labels

Usage Examples

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
examples		examples
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup_cluster.sh		setup_cluster.sh
setup_local.sh		setup_local.sh

License

helmholtz-analytics/hpc_pytorch_loader

Folders and files

Latest commit

History

Repository files navigation

Dataset Conversion Library

Features

Main Components

Installation

Local Setup

Cluster Setup

Usage

Folder Structure

Pytorch Dataset

Fixed-Size Datasets

Example: Using MemmapConverter

Example: Using a Torch Dataset

Flexible-Size Datasets

Example: Using IpcConverter

Reading Converted Datasets

Example: Using MemmapReader

Distributed Conversion

Python Script

Batch Script

Extending the Library

Example: Extention to Dense Segmentation Labels

Usage Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example: Using `MemmapConverter`

Example: Using `IpcConverter`

Example: Using `MemmapReader`

Packages