Skip to content
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
<img alt="JZFS Logo" src="https://github.com/GitDataAI/jzfs/blob/main/docs/jzfs-logo-words.png" width="400px">
</picture>
</p>

<h2 align="center">Git Based & Version Control & Joint Management <br/>for code, data, model and their relationship</h2>

> Delivers distributed data management system that keeps track of your data from code to PB scale dataset and ensures reproducibility.
Expand Down Expand Up @@ -58,7 +57,9 @@

JZFS adapts principles of open-source software development and distribution to address the technical challenges of data management, data sharing, and digital provenance collection across the life cycle of digital objects.

<div style="text-align: center"><img src="./docs/jzfs-joint-management.png" width="400" /></div>
<p align="center" width="100%">
<img style="align: center" alt="joint management of code,data,model,..." src="./docs/jzfs-joint-management.png" width="400" />
</p>

Compared with code in software development, data tend not to be as precisely
identified because data versioning is rarely or only coarsely practiced. Scientific computation
Expand Down Expand Up @@ -93,7 +94,7 @@ all experimental metadata about the complete timeline of longitudinal and multim
including MRI, histology, electrophysiology, and behavior.

![](./docs/jzfs-research-flow.png)
Project planning and experimental details are recorded in an in-house relational cloud-based database.
Project planning and experimental details are recorded in an in-house relational cloud-based database of GitData.AI.

A key element for both the database and the data storage is the
identifier, the study ID for each animal, used in a standardized fle name structure to make the data findable.
Expand Down
Binary file modified docs/jzfs-logo-words-light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/jzfs-logo-words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/jzfs-research-flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
75 changes: 75 additions & 0 deletions docs/jzfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
DataLad is a Python-based tool for the joint management of code, data, and their relationship,
built on top of a versatile system for data logistics (git-annex) and the most popular distributed
version control system (Git). It adapts principles of open-source software development and
distribution to address the technical challenges of data management, data sharing, and digital
provenance collection across the life cycle of digital objects. DataLad aims to make data
management as easy as managing code. It streamlines procedures to consume, publish, and
update data, for data of any size or type, and to link them as precisely versioned, lightweight
dependencies. DataLad helps to make science more reproducible and FAIR (Wilkinson et al.,
2016). It can capture complete and actionable process provenance of data transformations to
enable automatic re-computation. The DataLad project (datalad.org) delivers a completely
open, pioneering platform for flexible decentralized research data management (RDM) (Hanke,
Pestilli, et al., 2021). It features a Python and a command-line interface, an extensible
architecture, and does not depend on any centralized services but facilitates interoperability
with a plurality of existing tools and services. In order to maximize its utility and target audience, DataLad is available for all major operating systems, and can be integrated into
established workflows and environments with minimal friction.


Statement of Need
Code, data and computing environments are core components of scientific projects. While
the collaborative development and use of research software and code is streamlined with established procedures and infrastructures, such as software distributions, distributed version
control systems, and social coding portals like GitHub, other components of scientific projects
are not as transparently managed or accessible. Data consumption is complicated by disconnected data portals that require a large variety of different data access and authentication
methods. Compared with code in software development, data tend not to be as precisely
identified because data versioning is rarely or only coarsely practiced. Scientific computation
is not reproducible enough, because data provenance, the information of how a digital file
came to be, is often incomplete and rarely automatically captured. Last but not least, in
the absence of standardized data packages, there is no uniform way to declare actionable
data dependencies and derivative relationships between inputs and outputs of a computation. DataLad aims to solve these issues by providing streamlined, transparent management
of code, data, computing environments, and their relationship. It provides targeted interfaces
and interoperability adapters to established scientific and commercial tools and services to
set up unobstructed, unified access to all elements of scientific projects. This unique set of
features enables workflows that are particularly suited for reproducible science, such as actionable process provenance capture for arbitrary command execution that affords automatic
re-execution. To this end, it builds on and extends two established tools for version control
and transport logistics, Git and git-annex.


Why Git and git-annex?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Headings rendered as plain text – prefix with Markdown ##

Statement of Need, Why Git and git-annex?, and subsequent section titles are missing the #/## prefix, so they render as body text rather than headings. This hurts document structure and TOC generation.

-Statement of Need
+## Statement of Need
...
-Why Git and git-annex?
+## Why Git and git-annex?

Repeat for other section titles.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Statement of Need
Code, data and computing environments are core components of scientific projects. While
the collaborative development and use of research software and code is streamlined with established procedures and infrastructures, such as software distributions, distributed version
control systems, and social coding portals like GitHub, other components of scientific projects
are not as transparently managed or accessible. Data consumption is complicated by disconnected data portals that require a large variety of different data access and authentication
methods. Compared with code in software development, data tend not to be as precisely
identified because data versioning is rarely or only coarsely practiced. Scientific computation
is not reproducible enough, because data provenance, the information of how a digital file
came to be, is often incomplete and rarely automatically captured. Last but not least, in
the absence of standardized data packages, there is no uniform way to declare actionable
data dependencies and derivative relationships between inputs and outputs of a computation. DataLad aims to solve these issues by providing streamlined, transparent management
of code, data, computing environments, and their relationship. It provides targeted interfaces
and interoperability adapters to established scientific and commercial tools and services to
set up unobstructed, unified access to all elements of scientific projects. This unique set of
features enables workflows that are particularly suited for reproducible science, such as actionable process provenance capture for arbitrary command execution that affords automatic
re-execution. To this end, it builds on and extends two established tools for version control
and transport logistics, Git and git-annex.
Why Git and git-annex?
## Statement of Need
Code, data and computing environments are core components of scientific projects. While
the collaborative development and use of research software and code is streamlined with established procedures and infrastructures, such as software distributions, distributed version
control systems, and social coding portals like GitHub, other components of scientific projects
are not as transparently managed or accessible. Data consumption is complicated by disconnected data portals that require a large variety of different data access and authentication
methods. Compared with code in software development, data tend not to be as precisely
identified because data versioning is rarely or only coarsely practiced. Scientific computation
is not reproducible enough, because data provenance, the information of how a digital file
came to be, is often incomplete and rarely automatically captured. Last but not least, in
the absence of standardized data packages, there is no uniform way to declare actionable
data dependencies and derivative relationships between inputs and outputs of a computation. DataLad aims to solve these issues by providing streamlined, transparent management
of code, data, computing environments, and their relationship. It provides targeted interfaces
and interoperability adapters to established scientific and commercial tools and services to
set up unobstructed, unified access to all elements of scientific projects. This unique set of
features enables workflows that are particularly suited for reproducible science, such as actionable process provenance capture for arbitrary command execution that affords automatic
re-execution. To this end, it builds on and extends two established tools for version control
and transport logistics, Git and git-annex.
## Why Git and git-annex?
🧰 Tools
🪛 LanguageTool

[style] ~26-~26: ‘Last but not least’ might be wordy. Consider a shorter alternative.
Context: ...lete and rarely automatically captured. Last but not least, in the absence of standardized data pa...

(EN_WORDINESS_PREMIUM_LAST_BUT_NOT_LEAST)

🤖 Prompt for AI Agents
In docs/jzfs.md between lines 17 and 37, the section titles like "Statement of
Need" and "Why Git and git-annex?" are missing Markdown heading prefixes (# or
##), causing them to render as plain text instead of headings. To fix this, add
appropriate Markdown heading prefixes (e.g., ##) before each section title to
ensure they render correctly as headings and improve document structure and TOC
generation. Repeat this fix for all other section titles throughout the
document.

Git is the most popular version control system for software development1
. It is a distributed
content management system, specifically tuned towards managing and collaborating on text
files, and excels at making all committed content reliably and efficiently available to all clones
of a repository. At the same time, Git is not designed to efficiently handle large (e.g., over
a gigabyte) or binary files (see, e.g., Kenlon, 2016). This makes it hard or impossible to
use Git directly for distributed data storage with tailored access to individual files. Gitannex takes advantage of Git’s ability to efficiently manage textual information to overcome
this limitation. File content handled by git-annex is placed into a managed repository annex,
which avoids committing the file content directly to Git. Instead, git-annex commits a compact
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix footnote artefact and git-annex spelling

  • The “1” after “software development” looks like a dangling footnote marker – either add the reference or remove it.
  • “Gitannex” is misspelled; the project name is “git-annex”.
- system for software development1 .
+ system for software development.
...
- Gitannex takes advantage of Git’s ability
+ git-annex takes advantage of Git’s ability
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Git is the most popular version control system for software development1
. It is a distributed
content management system, specifically tuned towards managing and collaborating on text
files, and excels at making all committed content reliably and efficiently available to all clones
of a repository. At the same time, Git is not designed to efficiently handle large (e.g., over
a gigabyte) or binary files (see, e.g., Kenlon, 2016). This makes it hard or impossible to
use Git directly for distributed data storage with tailored access to individual files. Gitannex takes advantage of Git’s ability to efficiently manage textual information to overcome
this limitation. File content handled by git-annex is placed into a managed repository annex,
which avoids committing the file content directly to Git. Instead, git-annex commits a compact
Git is the most popular version control system for software development.
It is a distributed
content management system, specifically tuned towards managing and collaborating on text
files, and excels at making all committed content reliably and efficiently available to all clones
of a repository. At the same time, Git is not designed to efficiently handle large (e.g., over
a gigabyte) or binary files (see, e.g., Kenlon, 2016). This makes it hard or impossible to
use Git directly for distributed data storage with tailored access to individual files. git-annex takes advantage of Git’s ability to efficiently manage textual information to overcome
this limitation. File content handled by git-annex is placed into a managed repository annex,
which avoids committing the file content directly to Git. Instead, git-annex commits a compact
🧰 Tools
🪛 LanguageTool

[grammar] ~38-~38: Ensure spelling is correct
Context: ...lar version control system for software development1 . It is a distributed content management ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~44-~44: Ensure spelling is correct
Context: ...th tailored access to individual files. Gitannex takes advantage of Git’s ability to eff...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In docs/jzfs.md around lines 38 to 46, remove the dangling footnote marker "1"
after "software development" by either deleting it or adding the appropriate
reference, and correct the spelling of "Gitannex" to "git-annex" wherever it
appears to match the official project name.

reference, typically derived from the checksum of a file’s content, that enables identification
and association of a file name with the content. Using these identifiers, git-annex tracks
content availability across all repository clones and external resources such as URLs pointing
to individual files on the web. Upon user request, git-annex automatically manages data
transport to and from a local repository annex at a granularity of individual files. With this
simple approach, git-annex enables separate and optimized implementations for identification
and transport of arbitrarily large files, using an extensible set of protocols, while retaining the
distributed nature and compatibility with versatile workflows for versioning and collaboration
provided by Git.


What does DataLad add to Git and git-annex?
Easy to use modularization. Research workflows impose additional demands for an efficient research data management platform besides version control and data transport. Many
research datasets contain millions of files, but a large number of files precludes managing
such a dataset in a single Git repository, even if the total storage demand is not huge. Partitioning such datasets into smaller, linked components (e.g., one subdataset per sample in
a dataset comprising thousands) allows for scalable management. Research datasets and
projects can also be heterogeneous, comprising different data sources or evolving data across
different processing stages, and with different pace. Beyond scalability, modularization into
homogeneous components also enables efficient reuse of a selected subset of datasets and for
recording a derivative relationship between datasets. Git’s submodule mechanism provides a
way to nest individual repositories via unambiguously versioned linkage, but Git operations
must still be performed within each individual repository. To achieve modularity without impeding usability, DataLad simplifies working with the resulting hierarchies of Git repositories
via recursive operations across dataset boundaries. With this, DataLad provides a “monorepo”-like user experience in datasets with arbitrarily deep nesting, and makes it trivial to
operate on individual files deep in the hierarchy or entire trees of datasets. A testament of
this is datasets.datalad.org, created as the project’s initial goal to provide a data distribution
with unified access to already available public data archives in neuroscience, such as crcns.org
and openfmri.org. It is curated by the DataLad team and provides, at the time of publication,
streamlined access to over 260 TBs of data across over 5,000 subdatasets from a wide range
of projects and dozens of archives in a fully modularized way.