Skip to content

Commit d06f2d4

Browse files
committed
Initial commit
1 parent 5ec4c9d commit d06f2d4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+63032
-1
lines changed

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
exclude=.git,.gitignore,__pycache__,.ipynb_checkpoints,__init__.py
3+
ignore=W503,W605,E501

.gitignore

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# personal
2+
.DS_Store
3+
4+
# IDE
5+
.idea/
6+
.nfs*
7+
venv/
8+
.venv/
9+
.venv_annotation/
10+
.vscode
11+
12+
# PyBuilder
13+
target/
14+
15+
# Jupyter Notebook
16+
*.ipynb
17+
.ipynb_checkpoints
18+
__target__/
19+
20+
# Byte-compiled / optimized / DLL files
21+
__pycache__/
22+
*.py[cod]
23+
*$py.class
24+
25+
# C extensions
26+
*.so
27+
28+
# Distribution / packaging
29+
init
30+
.Python
31+
env/
32+
build/
33+
develop-eggs/
34+
dist/
35+
downloads/
36+
eggs/
37+
.eggs/
38+
lib/
39+
lib64/
40+
parts/
41+
sdist/
42+
var/
43+
*.egg-info/
44+
.installed.cfg
45+
*.egg
46+
47+
# Unit test / coverage reports
48+
htmlcov/
49+
.tox/
50+
.coverage
51+
.coverage.*
52+
.cache
53+
nosetests.xml
54+
coverage.xml
55+
*cover
56+
.hypothesis/
57+
.pytest_cache/
58+
59+
# IPython Notebook
60+
*.ipynb_checkpoints
61+
**/.ipynb_checkpoints
62+
*/.ipynb_checkpoints/*
63+
*.ipynb
64+
*ipynb_checkpoints*
65+
*-checkpoint*
66+
notebooks/annotation/*
67+
*.ipynb_checkpoints*
68+
cse_210033/.ipynb_checkpoints
69+
cse_210033/__pycache__
70+
71+
# Data
72+
*.csv
73+
*.xls
74+
*.xlsx
75+
*.pickle
76+
*.pkl
77+
*.html
78+
*.pdf
79+
80+
# Kernel
81+
kernel.json
82+
83+
# Logs
84+
**/lightning_logs

.pre-commit-config.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# See https://pre-commit.com for more information
2+
# See https://pre-commit.com/hooks.html for more hooks
3+
4+
repos:
5+
- repo: https://github.com/pre-commit/pre-commit-hooks
6+
rev: v4.2.0
7+
hooks:
8+
- id: trailing-whitespace
9+
- id: end-of-file-fixer
10+
- id: check-yaml
11+
- repo: https://github.com/pycqa/isort
12+
rev: 5.10.1
13+
hooks:
14+
- id: isort
15+
name: isort (python)
16+
args: ["--profile", "black"]
17+
- id: isort
18+
name: isort (cython)
19+
types: [cython]
20+
args: ["--profile", "black"]
21+
- id: isort
22+
name: isort (pyi)
23+
types: [pyi]
24+
args: ["--profile", "black"]
25+
26+
- repo: https://github.com/psf/black
27+
rev: 23.1.0
28+
hooks:
29+
- id: black
30+
- repo: https://github.com/asottile/blacken-docs
31+
rev: v1.12.1
32+
hooks:
33+
- id: blacken-docs
34+
exclude: notebooks/
35+
- repo: https://github.com/pycqa/flake8
36+
rev: '4.0.1'
37+
hooks:
38+
- id: flake8

LICENSE

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
BSD 3-Clause License
22

3-
Copyright (c) 2023, aphp-datascience
3+
Copyright (c) 2022, Assistance Publique - Hôpitaux de Paris
4+
All rights reserved.
45

56
Redistribution and use in source and binary forms, with or without
67
modification, are permitted provided that the following conditions are met:

README.md

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Adjusting for the progressive digitization of health records: working examples on a multi-hospital clinical data warehouse
2+
3+
4+
<div align="center">
5+
<img src="logo.svg" alt="EDS-TeVa">
6+
7+
<p align="center">
8+
<a href="https://github.com/psf/black" target="_blank">
9+
<img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Black">
10+
</a>
11+
<a href="https://python-poetry.org/" target="_blank">
12+
<img src="https://img.shields.io/badge/reproducibility-poetry-blue" alt="Poetry">
13+
</a>
14+
<a href="https://www.python.org/" target="_blank">
15+
<img src="https://img.shields.io/badge/python-%3E%3D%203.7.1%20%7C%20%3C%203.8-brightgreen" alt="Supported Python versions">
16+
</a>
17+
<a href="https://spark.apache.org/docs/2.4.8/" target="_blank">
18+
<img src="https://img.shields.io/badge/spark-2.4-brightgreen" alt="Supported Java version">
19+
</a>
20+
</p>
21+
</div>
22+
23+
## Study
24+
25+
This repositoy contains the computer code that has been executed to generate the results of the article:
26+
```
27+
@unpublished{edsteva,
28+
author = {Adam Remaki and Benoît Playe and Paul Bernard and Simon Vittoz and Matthieu Doutreligne and Gilles Chatellier and Etienne Audureau and Emmanuelle Kempf and Raphaël Porcher and Romain Bey},
29+
title = {Adjusting for the progressive digitization of health records: working examples on a multi-hospital clinical data warehouse},
30+
note = {Manuscript submitted for publication},
31+
year = {2023}
32+
}
33+
```
34+
The code has been executed on the OMOP database of the clinical data warehouse of the <a href="https://eds.aphp.fr/" target="_blank">Greater Paris University Hospitals</a>
35+
36+
- IRB number: CSE210033
37+
- This study stands on the shoulders of the library [EDS-TeVa](https://github.com/aphp/edsteva) (an open-source library providing a set of tools that aims at modeling the adoption over time and across space of the Electronic Health Records).
38+
## Version 1.0.0
39+
40+
- Submission of the article for review.
41+
## Setup
42+
43+
- In order to process large-scale data, the study uses [Spark 2.4](https://spark.apache.org/docs/2.4.8/index.html) (an open-source engine for large-scale data processing) which requires to:
44+
45+
- Install a version of Python $\geq 3.7.1$ and $< 3.8$.
46+
- Install Java 8 (you can install [OpenJDK 8](https://openjdk.org/projects/jdk8/), an open-source reference implementation of Java 8)
47+
48+
- Clone the repository:
49+
50+
```shell
51+
git clone https://gitlab.eds.aphp.fr/equipedatascience/cse_210033.git
52+
```
53+
54+
- Create a virtual environment with the suitable Python version (**>= 3.7.1 and < 3.8**):
55+
56+
```shell
57+
cd cse_210033
58+
python -m venv .venv
59+
source .venv/bin/activate
60+
```
61+
62+
- Install [Poetry](https://python-poetry.org/) (a tool for dependency management and packaging in Python) with the following command line:
63+
- Linux, macOS, Windows (WSL):
64+
65+
```shell
66+
curl -sSL https://install.python-poetry.org | python3 -
67+
```
68+
69+
- Windows (Powershell):
70+
71+
```shell
72+
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
73+
```
74+
75+
For more details, check the [installation guide](https://python-poetry.org/docs/#installation)
76+
77+
- Install dependencies:
78+
79+
```shell
80+
pip install pypandoc==1.7.5
81+
pip install pyspark==2.4.8
82+
poetry install
83+
pip uninstall pypandoc
84+
```
85+
## How to run the code on AP-HP's data platform
86+
### 1. Install EDS-Toolbox:
87+
88+
EDS-Toolbox is python library that provides an efficient way of submitting PySpark scripts on AP-HP's data platform. As it is AP-HP specific, it is not available on PyPI:
89+
90+
```shell
91+
pip install git+ssh://[email protected]:2224/datasciencetools/[email protected]
92+
```
93+
### 2. Pre-processing: Compute, save models and data:
94+
95+
:warning: Depending on your resources, this step can take some times.
96+
97+
```shell
98+
cd scripts
99+
eds-toolbox spark submit --config ../conf/config.cfg --log-path ../logs/ehr_modeling ehr_modeling.py
100+
eds-toolbox spark submit --config ../conf/config.cfg --log-path ../logs/cohort_selection cohort_selection.py
101+
```
102+
103+
### 3. Post-processing: Main statistical analysis
104+
105+
```shell
106+
pip install pyarrow==12.0.1
107+
python statistical_analysis.py --config-name config.cfg
108+
```
109+
110+
### 4. Generate figures
111+
112+
- **Option 1**: Generate all figures in a raw from the terminal:
113+
114+
```shell
115+
python generate_figures.py --config-name config.cfg
116+
```
117+
118+
- **Option 2**: Generate figure one at a time from a notebook:
119+
120+
- Create a Spark-enabled kernel with your environnement:
121+
122+
```shell
123+
eds-toolbox kernel --spark --hdfs
124+
```
125+
126+
- Convert markdown into jupyter notebook:
127+
128+
```shell
129+
cd notebooks
130+
jupytext --set-formats md,ipynb 'generate_figures.md'
131+
```
132+
133+
- Open *generate_figures.ipynb* and start the kernel you've just created.
134+
- Run the cells to obtain every figure.
135+
136+
### 5. Generate HTML report
137+
138+
- Create a Spark-enabled kernel with your environnement (if you have not previously):
139+
140+
```shell
141+
eds-toolbox kernel --spark --hdfs
142+
```
143+
144+
- Convert markdown into jupyter notebook:
145+
146+
```shell
147+
cd notebooks
148+
jupytext --set-formats md,ipynb 'report.md'
149+
```
150+
151+
- Open *report.ipynb*, start the kernel you've created and run the cells.
152+
153+
- Convert notebook to HTML:
154+
```shell
155+
eds-toolbox report report.ipynb --output report.html
156+
```
157+
158+
#### Note
159+
If you would like to run the scripts on a different database from the AP-HP database, you will have to adapt the python scripts with the configuration of the desired database.
160+
## Project structure
161+
162+
- `conf`: Configuration files.
163+
- `data`: Saved processed data and models.
164+
- `figures`: Saved results.
165+
- `notebooks`: Notebooks that generate figures.
166+
- `cse_210033`: Source code.
167+
- `scripts`: Typer applications to process data and generate figures.
168+
169+
## Acknowledgement
170+
171+
We would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/) and [AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.

0 commit comments

Comments
 (0)