Skip to content

Commit 43f4c52

Browse files
authored
Merge pull request #72 from vtraag/revise/OCDsprint
Revised data/code indicators
2 parents c4fc73d + 5726fae commit 43f4c52

File tree

4 files changed

+99
-29
lines changed

4 files changed

+99
-29
lines changed

indicator_templates/quarto/1_open_science/prevalence_open_fair_data_practices.qmd

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -66,31 +66,27 @@ PLOS also provides API's to search its database. This [page](https://api.plos.or
6666

6767
### Level of FAIRness of data
6868

69-
Metrics on the level of FAIRness of data (sources) can support in establishing the prevalence of open/FAIR data practices. This metric attempts to show in a more nuanced manner where FAIR data practices are used and in some cases even to what extent they are used. Assessing whether or not a data source practices FAIR principles is not trivial with a quick glance, but there are initiatives that developed methodologies that assist to determine this for (a large number of) data sources.
69+
Metrics on the level of FAIRness of data (sources) can support in establishing the prevalence of open/FAIR data practices. This metric attempts to show in a more nuanced manner where FAIR data practices are used and in some cases even to what extent they are used. Assessing whether or not a data source practices FAIR principles is not trivial with a quick glance, but there are some initiatives that developed methodologies that assist to determine this for (a large number of) data sources.
7070

7171
#### Measurement.
7272

7373
##### Existing methodologies
7474

7575
###### Research Data Alliance
7676

77-
The Research Data Alliance developed a [FAIR Data Maturity Model](https://www.rd-alliance.org/group/fair-data-maturity-model-wg/outcomes/fair-data-maturity-model-specification-and-guidelines-0) that can help to assess whether or not data adheres to the FAIR principles. This document is not meant to be a normative model, but provide guidelines for informed assessment.
77+
The Research Data Alliance developed a FAIR Data Maturity Model [@group_fair_2020] that can help to assess whether or not data adheres to the FAIR principles. This document is not meant to be a normative model, but provide guidelines for informed assessment.
7878

79-
The [document](https://www.rd-alliance.org/system/files/FAIR%20Data%20Maturity%20Model_%20specification%20and%20guidelines_v1.00.pdf) includes a set of indicators for each of the four FAIR principles that can be used to assess whether or not the principles are met. Each indicator is described in detail and its relevance is annotated (essential, important or useful). The model recommends to evaluate the maturity of each indicator with the following set of maturity categories:
79+
The FAIR Data Maturity Model includes a set of indicators for each of the four FAIR principles that can be used to assess whether or not the principles are met. Each indicator is described in detail and its relevance is annotated (essential, important or useful). The model recommends to evaluate the maturity of each indicator with the following set of maturity categories:
8080

81-
0 – not applicable
82-
83-
1 – not being considered yet
84-
85-
2 – under consideration or in planning phase
86-
87-
3 – in implementation phase
88-
89-
4 – fully implemented
81+
0. Not applicable
82+
1. Not being considered yet
83+
2. Under consideration or in planning phase
84+
3. In implementation phase
85+
4. Fully implemented
9086

9187
By following this methodology, one could assess to what extent the FAIR data practices are adhered to and create comprehensive overviews, for instance by showing the scores in radar charts.
9288

93-
Data life cycle assessment
89+
###### Data life cycle assessment
9490

9591
Determining the level of FAIR data practices can involve assessing how well data adheres to the FAIR principles at each stage of the data lifecycle, from creation to sharing and reuse [@jacob2019].
9692

@@ -100,9 +96,13 @@ Evaluate adherence to FAIR principles at each stage: For each stage of the data
10096

10197
Determine the overall level of FAIR data practices: Once the scores for each principle and stage have been assigned, determine the overall level of FAIR data practices. This can be done by using a summary score that takes into account the scores for each principle and stage, or by assigning a level of FAIR data practices based on the average score across the principles and stages.
10298

99+
###### Automated detection of FAIRness
100+
101+
There are some attempts at trying to establish the FAIRness of data automatically. One such a tool, called F-UJI is available from <https://www.f-uji.net>, developed by @devaraju_f-uji_2024. The accuracy of the tool is not reported.
102+
103103
### Availability of data statement
104104

105-
A data availability statement in a publication describes how the reader could get access to the data of the research. Having such a statement in place improves transparency on data availability and can thus be considered as an Open Data practice. However, having a data availability statement in place does not necessarily imply that the data is openly available or that it is more likely that the data can be shared [@gabelica2022]. Nevertheless, a description of how to access an Open Data repository, how to make a request for data access or an explanation why some data cannot be shared due to ethical considerations are all examples of Open Data practices that make data reuse more accessible and transparent [@federer2018]. The availability of a data statement can therefore be considered as an Open Data practice.
105+
A data availability statement in a publication describes how the reader could get access to the data of the research. Having such a statement in place improves transparency on data availability and can be considered as an Open Data practice. However, having a data availability statement in place does not necessarily imply that the data is openly available or that it is more likely that the data can be shared [@gabelica2022]. Nevertheless, a description of how to access an Open Data repository, how to make a request for data access or an explanation why some data cannot be shared due to ethical considerations are all examples of Open Data practices that make data reuse more accessible and transparent [@federer2018]. Indeed, even if data itself cannot be shared, metadata can typically be shared.
106106

107107
#### Measurement
108108

@@ -112,4 +112,4 @@ All PLOS journals require publications to include a data availability statement.
112112

113113
## Known correlates
114114

115-
Some research suggests that openly sharing data is positively related to the citation rate of publications [@piwowar2007;@piwowar2013].
115+
Some research suggests that openly sharing data is positively related to the citation rate of publications [@piwowar2007; @piwowar2013].

indicator_templates/quarto/2_academic_impact/use_of_code_in_research.qmd

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,17 @@ affiliations:
2727

2828
## Description
2929

30-
Many, if not most, scientific analyses involve the use of code or software in one way or another. Code and software can be used for data handling, statistical estimation, visualisation, or various other tasks. Both open-source and closed-source software may be used for research. For instance, MATLAB and Mathematica are two commercial software packages that may be used in research, whereas Octave and SageMath are open-source alternatives. We here try to provide metrics that can serve as an indicator of the use of code in research, where code refers to any type of software (e.g. computer library, tool, package) or any set of computer instructions (e.g. like an R or Python script) used in the research cycle.
30+
Many, if not most, scientific analyses involve the use of code or software in one way or another. Code and software can be used for data handling, statistical estimation, visualisation, or various other tasks. Both open-source and closed-source software may be used for research. For instance, MATLAB and Mathematica are two commercial software packages that may be used in research, whereas Octave and SageMath are open-source alternatives. We here try to provide metrics that can serve as an indicator of the use of code in research, where "code" refers to any type of software (e.g. computer library, tool, package) or any set of computer instructions (e.g. like an R or Python script) used in the research cycle.
3131

32-
One challenge is that we are typically interested in the use of “research software”, not in all software per se. Defining what this encompasses is not straightforward. [@gruenpeter2021] defines it as code “that \[was\] created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software” (Gruenpeter et al., 2021, p. 16) As this clarifies, this might also involve the creation of new software that is released for other researchers to work with., However, this is not considered in this indicator, but in the indicator on open code. Almost any code depends on other code to work properly. Some of these dependencies might constitute research software themselves, but this is not necessarily the case. Instead of trying to classify software as “research software” or not, we will take a more agnostic approach in the description of this indicator, and simply try to describe approaches to uncover the use of some code in research, regardless of whether it constitutes “research software” or not.
32+
One challenge is that we are typically interested in the use of "research software", not in all software per se. Defining what this encompasses is not straightforward. [@gruenpeter2021] defines it as code "that \[was\] created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software" [@gruenpeter2021, p. 16]. As this clarifies, this might also involve the creation of new software that is released for other researchers to work with., However, this is not considered in this indicator, but in the indicator on open code. Almost any code depends on other code to work properly. Some of these dependencies might constitute research software themselves, but this is not necessarily the case. Instead of trying to classify software as "research software" or not, we will take a more agnostic approach in the description of this indicator, and simply try to describe approaches to uncover the use of some code in research, regardless of whether it constitutes "research software" or not.
33+
34+
Sometimes a distinction is made between "reuse" and "use", where "reuse" refers explicitly to the use of openly released software, whereas "use" refers to the use of software more generally. We do not make such a distinction here.
3335

3436
This indicator can be useful to provide a more comprehensive view of the impact of the contributions by researchers. Some researchers might be more involved in publishing, whereas others might be more involved in developing and maintaining research software (and possibly a myriad other activities).
3537

3638
## Metrics
3739

38-
Most research software is not properly indexed. There are initiatives to have research software properly indexed and identified, such as the [Research Software Directory,](https://research-software-directory.org/) but these are far from comprehensive at the moment. Many repositories support uploading research software. For instance, Zenodo currently holds about 116,000 records of research software. However, there are also reports of the absence of support for including research software in repositories [@carlin2023].
40+
Most research software is not properly indexed. There are initiatives to have research software properly indexed and identified, such as the [Research Software Directory,](https://research-software-directory.org/) but these are far from comprehensive at the moment, and is the topic of ongoing research [@malviya-thakur_scicat_2023]. Many repositories support uploading research software. For instance, Zenodo currently holds about 116,000 records of research software. However, there are also reports of the absence of support for including research software in repositories [@carlin2023].
3941

4042
### Number of times code is cited/mentioned in scientific publications
4143

@@ -45,9 +47,9 @@ The biggest limitation is that not all researchers report all research software
4547

4648
In addition, software might not be cited explicitly, and instead the paper associated with the software might be cited. The association between papers and software can be retrieved in various ways. Sometimes, software repositories are mentioned in papers, while vice-versa, the software repository may include citation information. This may take various forms, such as a [`CITATION.cff`](https://citation-file-format.github.io/) file in a GitHub repository, or a [`CITATION`](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/citation.html) file in an R package. The association between papers and code is also being tracked by <https://paperswithcode.com/>. However, it is difficult to distinguish between citations to a publication for the software it introduced, or other advances made in the paper. Nonetheless, it might be relevant to combine citations statistics to the paper with explicit citations or mentions of the research software.
4749

48-
Measurement.
50+
#### Measurement
4951

50-
##### Existing datasources:
52+
##### Existing datasources
5153

5254
###### Bibliometric databases
5355

@@ -59,10 +61,12 @@ Not all bibliometric databases actively track research software, and therefore n
5961

6062
###### Extract software mentions from full text
6163

62-
Especially because of the limited explicit references to software, it is important to also explore other possibilities to track the use of code in research. One possibility is to try to extract the mentions of a software package or tool from the full-text. This is done by [@istrate] who have trained a machine learning model to extract references to software from full-text. They rely on the manual annotation of software mentions in PDFs by [@du2021]. The resulting dataset of software mentions is available from <https://doi.org/10.5061/dryad.6wwpzgn2c>.
64+
Especially because of the limited explicit references to software, it is important to also explore other possibilities to track the use of code in research. One possibility is to try to extract the mentions of a software package or tool from the full-text. This is done by [@istrate] who have trained a machine learning model to extract references to software from full-text. They rely on the manual annotation of software mentions in PDFs by [@du2021]. The resulting dataset of software mentions is made available publicly [@istrate_cz_2022].
6365

6466
Although the dataset of software mentions might provide a useful resource, it is a static dataset, and at the moment, there do not yet seem to be initiative to continuously monitor and scan the full-text of publications. Additionally, its coverage is limited to mostly biomedical literature. For that reason, it might be necessary to run the proposed machine learning algorithm itself. The code is available from <https://github.com/chanzuckerberg/software-mention-extraction>.
6567

68+
A common "gold standard" dataset for training software mention extraction from full text is the so-called SoftCite dataset [@howison_softcite_2023].
69+
6670
### Repository statistics (# Forks/Clones/Stars/Downloads/Views)
6771

6872
Much (open-source) software is shared in version control repositories in online platforms. Various types of usage statistics can be derived from these online platforms, that somehow relate to the general level of interest in the software. These metrics vary from how many other users have copies of those repositories (often called forks), to how many people downloaded a particular release from this platform.
@@ -71,15 +75,15 @@ There are some clear limitations to this approach. Firstly, not all research sof
7175

7276
The most common version control system at the moment is [Git](https://git-scm.com/), which itself is open-source. There are other version control systems, such as Subversion or Mercurial, but these are less popular. The most common platform on which Git repositories are shared is GitHub, which is not open-source itself. There are also other repository platforms, such as [CodeBerg](https://codeberg.org/) (built on [Forgejo](https://forgejo.org/)) and [GitLab](https://gitlab.com/), which are themselves open-source, but they have not yet managed to reach the popularity of GitHub. We therefore limit ourselves to describing GitHub, although we might extend this in the future.
7377

74-
#### Measurement.
78+
#### Measurement
7579

7680
We propose three concrete metrics based on the GitHub API: the number of forks, the number of stars and the number of downloads of releases. There are additional metrics about traffic available from [GitHub API metrics](https://docs.github.com/en/rest/metrics), but these unfortunately require permissions from a specific repository.\
7781

7882
##### Existing methodologies
7983

8084
###### Forks/Stars (GitHub API)
8185

82-
On GitHub, people can make a personal copy of a repository, which is called a fork. In addition, they can star a repository, in order to "save it in their list of favourite repositories. The number of forks of a repository hence provides a metric of how many people have made personal copies of a repository, and the number of stars provides a metric of how many people have marked it as a favourite.
86+
On GitHub, people can make a personal copy of a repository, which is called a fork. In addition, they can "star" a repository, in order to "save" it in their list of "favourite" repositories. The number of forks of a repository hence provides a metric of how many people have made personal copies of a repository, and the number of stars provides a metric of how many people have marked it as a "favourite".
8387

8488
The calculation of the number of forks and the number of stars is really straightforward. For a particular `repo` from a particular `owner`, one can get the count from <https://api.github.com/repos/owner/repo>. For instance, for the repository `openalex-guts` from `ourresearch`, one can get the information from the URL [https://api.github.com/repos/ourresearch/openal](https://api.github.com/repos/ourresearch/openalex-guts)ex-guts. The number of forks are then listed in the field `forks_count` and the number of starts are listed in `stargazers_count`. See the API documentation for more details.
8589

0 commit comments

Comments
 (0)