-
Notifications
You must be signed in to change notification settings - Fork 187
[WIP] Suggested modifications to directory layout of the bids-study DatasetType
#2191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2191 +/- ##
=======================================
Coverage 82.71% 82.71%
=======================================
Files 20 20
Lines 1608 1608
=======================================
Hits 1330 1330
Misses 278 278 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
DatasetType
This makes a lot of sense to me, and would help everyone to more straightforwardly differentiate source-data and raw-bids much more easily. |
This solution feels much clearer, I think it's a good idea ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two notes on the schema. I don't have a strong opinion about bids-raw/
, except to note that this will be the first use of a hyphen that doesn't separate key-value entities. If you want to be consistent with older versions of the spec, you could call it rawdata/
. Many people seemed to think that was supposed to be valid...
src/schema/rules/directories.yaml
Outdated
# No naming convention applies, and the requirement level and opacity would be superfluous. | ||
# | ||
study: | ||
bids-study: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These top-level keys are DatasetType
values. If you want them to be more mnemonic, you'll need some alternative way to match to DatasetType
.
I would also recommend against hyphens in keys, as we have heretofore used valid identifiers, which makes traversing the schema using dot notation (rules.directories.study
) possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry this was not the right place to suggest changes to DatasetType
values.
The idea was to introduce bids-
prefix in the directory names as an easier way to infer DatasetType
similar to sub-
and ses-
directories. But that would possibly only work for bids-raw
and bids-derivative
subdirs within a study
layout. Not sure if that's ideal at this point, so will simply revert to original naming.
src/schema/rules/directories.yaml
Outdated
bids-raw: | ||
name: bids-raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here name
would contain the actual intended directory name, while the key is just a value that shows up in subdirs
. You could do something like:
bids-raw: | |
name: bids-raw | |
bids_raw: | |
name: bids-raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will drop the bids
prefix idea and go back to calling it rawdata
and rely on docs to explain that this is a raw
bids Datatype
clarify any confusion with the sourcedata
. Does this work?
derivatives: | ||
name: derivatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I oppose this change on the grounds that derivatives/
already has a defined meaning, and a different name would imply a different meaning. For example bids-derivatives/
would probably be expected to contain only BIDS-Derivatives datasets, and no nonstandard derivatives (like freesurfer/
). If you're not changing the meaning, don't change the name.
Further, using the same directory names as other BIDS dataset types allows for code that recursively indexes datasets to care very little about the specific DatasetType
of the current dataset. The proposal to add a raw directory would be an addition, but I think it's sensible to be able to crawl sourcedata/
and derivatives/
the same as any other dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point which I mostly agree. Although, I think what bids
prefix would imply is a bit unclear.
Currently since the spec allows nonstandard pipeline output inside a valid BIDS dataset, I didn't think bids-derivatives
DatasetType
would imply change in that policy / meaning.
But based on earlier point, the proposal of bids
suffix doesn't seem ideal here, so will revert to derivatives
phenotype: | ||
name: phenotype | ||
level: optional | ||
opaque: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed after this discussion: #2185 (comment)
I didn't see a justification for re-adding it, so I want to make sure this isn't slipped in with little consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should have explained this addition based on offline conversations.
The reason for removing phenotype
directory was that it would presumably appear in the raw dataset. But isn't that true for derivatives
directory as well? It's a bit confusing having derivatives
part of all three DatasetType
s but enforcing phenotype
only be part of raw
or derivative
. In practice, phenotype
data curation is likely to be more independent than that of derivatives
, which is one of the reasons we prefer it to be at the top level.
@effigies - Thanks for the feedback and patiently dealing with these multiple intertwined open issues and hurried PR from my side - a result of mixed-up online and offline conversations. Will respond individually to the inline comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts:
-
I think we need to be more clear about what
bids-study
is. If I was a naive user, I would thinkbids-study
is a layout for organizing my study in "the bids way". Meaning that as soon as I move from unorganizedsourcedata
to some sort of organized data, it should be BIDS compliant. Thus, I do not need a bids prefix forraw
orderivatives
since my assumption is "everything BIDS". If we want to expand bids-study to allow it to be consumed by other schemas or standards (e.g., Nipoppy), I think we can find a different way than the folder-naming to make them compatible and validate. E.g., adding expanded validation rules in the sense of „if type bids-study is used, everything that BIDS has a spec on (raw, derivatives, phenotype etc.) needs to be in BIDS standard“ or „„if type bids-study is used, at least X% of data needs to be BIDS“ or add an output-validator file that lists which of the data was validated by BIDS. (idk if any of those suggestions make sense, it’s just wild ideas :-D) -
It is true that
phenotype
is supposed to be on the level of theraw
data. However, the<measurement_tool_name>.tsv/.json
are connected to the (recommended)sessions.tsv/.json
file which should be kept at the root level. Thus, it would be more convenient to not have too deep nesting ofphenotype
. Also, as measurement data can also evolve from source to raw, the<measurement_tool_name>.tsv/.json
might contain information for both types. -
in offline conversation we also had the discussion about making
bids-study
not aDatasetType
but aStudyType
. Reasoning for that would be to have a clear specification forStudyType
, maybe the validation would be wasier than changing the nesting logic of existing DatasetTypes and it is expandable to e.g. Meta or Mega Analysis etc. Wdyt?
This is a continuation of discussion on BIDS
DatasetType
started in #1972 and #2185. The proposedDatasetType
s include: bids-study
,raw
, andderivative
.I think this is a fantastic idea with the capacity to curate modular and nested standardized datasets.
However, the directory layout for
bids-study
proposed in #2185 is suboptimal from the data visibility perspective. Currently, thebids-study
does not have a root-level subdirectory forbids-raw
dataset. Based on previous conversations it was suggested thatbids-raw
dataset can be stowed insidesourcedata
dir. This hidden location for a valid and probably the most commonDatasetType
is probably not ideal and will confuse new users.So, I would like to make a case for treating
sourcedata
,bids-raw
, andbids-derivative
with equal importance by putting them on the same root-level inside thebids-study
directory tree. In my experience this is more intuitive and helpful for data management of neuroimaging studies, where potentially different people will handle these threeDatasetType
s.The suggested
bids-
prefix for these directories is mostly to avoid possible confusion betweensource
vsraw
data connotations from past discussions and to indicate each of them can be stand-alone BIDS datasets.Happy to hear your thoughts @michellewang, @jbpoline, @yarikoptic, @effigies, @mathdugre, @julia-pfarr, @nburgos, @AliceJoubert, @Adam-Ismaili-92, @surchs