Skip to content

Conversation

nikhil153
Copy link

@nikhil153 nikhil153 commented Sep 2, 2025

This is a continuation of discussion on BIDS DatasetType started in #1972 and #2185. The proposed DatasetTypes include: bids-study, raw, and derivative.

I think this is a fantastic idea with the capacity to curate modular and nested standardized datasets.

However, the directory layout for bids-study proposed in #2185 is suboptimal from the data visibility perspective. Currently, the bids-study does not have a root-level subdirectory for bids-raw dataset. Based on previous conversations it was suggested that bids-raw dataset can be stowed inside sourcedata dir. This hidden location for a valid and probably the most common DatasetType is probably not ideal and will confuse new users.

So, I would like to make a case for treating sourcedata, bids-raw, and bids-derivative with equal importance by putting them on the same root-level inside the bids-study directory tree. In my experience this is more intuitive and helpful for data management of neuroimaging studies, where potentially different people will handle these three DatasetTypes.

The suggested bids- prefix for these directories is mostly to avoid possible confusion between source vs raw data connotations from past discussions and to indicate each of them can be stand-alone BIDS datasets.

Happy to hear your thoughts @michellewang, @jbpoline, @yarikoptic, @effigies, @mathdugre, @julia-pfarr, @nburgos, @AliceJoubert, @Adam-Ismaili-92, @surchs

Copy link

codecov bot commented Sep 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.71%. Comparing base (73281ae) to head (e0fe3ce).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2191   +/-   ##
=======================================
  Coverage   82.71%   82.71%           
=======================================
  Files          20       20           
  Lines        1608     1608           
=======================================
  Hits         1330     1330           
  Misses        278      278           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nikhil153 nikhil153 changed the title [WIP] Suggested modifications to directory layout of the bids-study Dataset… [WIP] Suggested modifications to directory layout of the bids-study DatasetType Sep 2, 2025
@jbpoline
Copy link
Contributor

jbpoline commented Sep 3, 2025

This makes a lot of sense to me, and would help everyone to more straightforwardly differentiate source-data and raw-bids much more easily.

@AliceJoubert
Copy link

This solution feels much clearer, I think it's a good idea !

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes on the schema. I don't have a strong opinion about bids-raw/, except to note that this will be the first use of a hyphen that doesn't separate key-value entities. If you want to be consistent with older versions of the spec, you could call it rawdata/. Many people seemed to think that was supposed to be valid...

# No naming convention applies, and the requirement level and opacity would be superfluous.
#
study:
bids-study:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top-level keys are DatasetType values. If you want them to be more mnemonic, you'll need some alternative way to match to DatasetType.

I would also recommend against hyphens in keys, as we have heretofore used valid identifiers, which makes traversing the schema using dot notation (rules.directories.study) possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry this was not the right place to suggest changes to DatasetType values.
The idea was to introduce bids- prefix in the directory names as an easier way to infer DatasetType similar to sub- and ses- directories. But that would possibly only work for bids-raw and bids-derivative subdirs within a study layout. Not sure if that's ideal at this point, so will simply revert to original naming.

Comment on lines 37 to 38
bids-raw:
name: bids-raw
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here name would contain the actual intended directory name, while the key is just a value that shows up in subdirs. You could do something like:

Suggested change
bids-raw:
name: bids-raw
bids_raw:
name: bids-raw

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will drop the bids prefix idea and go back to calling it rawdata and rely on docs to explain that this is a raw bids Datatype clarify any confusion with the sourcedata. Does this work?

Comment on lines 35 to 36
derivatives:
name: derivatives
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I oppose this change on the grounds that derivatives/ already has a defined meaning, and a different name would imply a different meaning. For example bids-derivatives/ would probably be expected to contain only BIDS-Derivatives datasets, and no nonstandard derivatives (like freesurfer/). If you're not changing the meaning, don't change the name.

Further, using the same directory names as other BIDS dataset types allows for code that recursively indexes datasets to care very little about the specific DatasetType of the current dataset. The proposal to add a raw directory would be an addition, but I think it's sensible to be able to crawl sourcedata/ and derivatives/ the same as any other dataset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point which I mostly agree. Although, I think what bids prefix would imply is a bit unclear.
Currently since the spec allows nonstandard pipeline output inside a valid BIDS dataset, I didn't think bids-derivatives DatasetType would imply change in that policy / meaning.
But based on earlier point, the proposal of bids suffix doesn't seem ideal here, so will revert to derivatives

Comment on lines +53 to +56
phenotype:
name: phenotype
level: optional
opaque: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed after this discussion: #2185 (comment)

I didn't see a justification for re-adding it, so I want to make sure this isn't slipped in with little consideration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should have explained this addition based on offline conversations.
The reason for removing phenotype directory was that it would presumably appear in the raw dataset. But isn't that true for derivatives directory as well? It's a bit confusing having derivatives part of all three DatasetTypes but enforcing phenotype only be part of raw or derivative. In practice, phenotype data curation is likely to be more independent than that of derivatives, which is one of the reasons we prefer it to be at the top level.

@effigies effigies added the discussion ongoing discussion label Sep 5, 2025
@nikhil153
Copy link
Author

@effigies - Thanks for the feedback and patiently dealing with these multiple intertwined open issues and hurried PR from my side - a result of mixed-up online and offline conversations. Will respond individually to the inline comments.

Copy link
Member

@julia-pfarr julia-pfarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts:

  1. I think we need to be more clear about what bids-study is. If I was a naive user, I would think bids-study is a layout for organizing my study in "the bids way". Meaning that as soon as I move from unorganized sourcedata to some sort of organized data, it should be BIDS compliant. Thus, I do not need a bids prefix for raw or derivatives since my assumption is "everything BIDS". If we want to expand bids-study to allow it to be consumed by other schemas or standards (e.g., Nipoppy), I think we can find a different way than the folder-naming to make them compatible and validate. E.g., adding expanded validation rules in the sense of „if type bids-study is used, everything that BIDS has a spec on (raw, derivatives, phenotype etc.) needs to be in BIDS standard“ or „„if type bids-study is used, at least X% of data needs to be BIDS“ or add an output-validator file that lists which of the data was validated by BIDS. (idk if any of those suggestions make sense, it’s just wild ideas :-D)

  2. It is true that phenotype is supposed to be on the level of the raw data. However, the <measurement_tool_name>.tsv/.json are connected to the (recommended) sessions.tsv/.json file which should be kept at the root level. Thus, it would be more convenient to not have too deep nesting of phenotype. Also, as measurement data can also evolve from source to raw, the <measurement_tool_name>.tsv/.json might contain information for both types.

  3. in offline conversation we also had the discussion about making bids-study not a DatasetType but a StudyType. Reasoning for that would be to have a clear specification for StudyType, maybe the validation would be wasier than changing the nesting logic of existing DatasetTypes and it is expandable to e.g. Meta or Mega Analysis etc. Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion ongoing discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants