Skip to content

Data flow and management within a FRIDGE #107

@craddm

Description

@craddm

We need to work out what the actual data flow is inside the FRIDGE .

For performance reasons, we will typically want the data to be on storage attached to the node where the workflows will run.

Here's an example:

Data Owner A has 4 TB of data to put in the FRIDGE.
Data Owner A is asked to upload it using MinIO.
MinIO thus needs to have 4 TB of space available to receive the data.
Where is that disk space?
Assume that MinIO is running on a non-GPU node on Dawn - that node needs 4TB additional storage attaching to it just for MinIO to use.
Note that scaling the amount of storage has up and down is not trivial once it's deployed - you have to add/remove additional storage pools

Researcher A then wants to run a workflow using that data.
How is the data made available to the Workflow pod?
Options:

  1. The Workflow requests the data as an input from the repository.
    This makes a copy of the data on ephemeral storage.
    This means there needs to be enough space to make a complete copy of the data on the GPU node where the workflow is running
  2. Mount a volume containing the data
    This will require the data to be moved out of the MinIO bucket and onto a standard volume that can then be mounted on the workflow pod (thus we essentially need double the space?)
  3. Mount bucket as a volume
    May be possible using something like https://github.com/s3fs-fuse/s3fs-fuse

Once the workflow is complete, where does the output go?

  1. Into a MinIO bucket? There needs to be sufficient space available to MinIO
  2. Into a volume? That output then needs to be transferred to MinIO

Upshot: we likely need a mechanism for transferring things from MinIO buckets to volumes and vice versa

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions