Skip to content

Commit 4adaaf7

Browse files
authored
Merge pull request #432 from datamol-io/caching
Caching logic improvement
2 parents beaf954 + cc91bfa commit 4adaaf7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+361
-220
lines changed

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,23 @@ graphium-train --config-path [PATH] --config-name [CONFIG]
9797
```
9898
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.
9999

100+
## Preparing the data in advance
101+
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.
102+
103+
However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.
104+
105+
The following command-line will prepare the data and cache it, then use it to train a model.
106+
```bash
107+
# First prepare the data and cache it in `path_to_cached_data`
108+
graphium-prepare-data datamodule.args.processed_graph_data_path=[path_to_cached_data]
109+
110+
# Then train the model on the prepared data
111+
graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
112+
```
113+
114+
**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.
115+
116+
**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
100117

101118
## License
102119

docs/tutorials/feature_processing/choosing_parallelization.ipynb

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
},
1515
{
1616
"cell_type": "code",
17-
"execution_count": 1,
17+
"execution_count": 3,
1818
"id": "b5df2ac6-2ded-4597-a445-f2b5fb106330",
1919
"metadata": {
2020
"tags": []
@@ -24,8 +24,8 @@
2424
"name": "stdout",
2525
"output_type": "stream",
2626
"text": [
27-
"INFO: Pandarallel will run on 240 workers.\n",
28-
"INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.\n"
27+
"The autoreload extension is already loaded. To reload it, use:\n",
28+
" %reload_ext autoreload\n"
2929
]
3030
}
3131
],
@@ -39,9 +39,9 @@
3939
"import datamol as dm\n",
4040
"import pandas as pd\n",
4141
"\n",
42-
"from pandarallel import pandarallel\n",
42+
"# from pandarallel import pandarallel\n",
4343
"\n",
44-
"pandarallel.initialize(progress_bar=True, nb_workers=joblib.cpu_count())"
44+
"# pandarallel.initialize(progress_bar=True, nb_workers=joblib.cpu_count())"
4545
]
4646
},
4747
{
@@ -54,7 +54,7 @@
5454
},
5555
{
5656
"cell_type": "code",
57-
"execution_count": 2,
57+
"execution_count": 4,
5858
"id": "0f31e18d-bdd9-4d9b-8ba5-81e5887b857e",
5959
"metadata": {
6060
"tags": []
@@ -70,7 +70,7 @@
7070
},
7171
{
7272
"cell_type": "code",
73-
"execution_count": 3,
73+
"execution_count": 7,
7474
"id": "a1197c31-7dbc-4fd7-a69a-5215e1a96b8e",
7575
"metadata": {
7676
"tags": []
@@ -109,7 +109,7 @@
109109
},
110110
{
111111
"cell_type": "code",
112-
"execution_count": 4,
112+
"execution_count": 10,
113113
"id": "2f8ce5c3-4232-4279-8ea3-7a74832303be",
114114
"metadata": {
115115
"tags": []
@@ -129,7 +129,7 @@
129129
},
130130
{
131131
"cell_type": "code",
132-
"execution_count": 5,
132+
"execution_count": 11,
133133
"id": "a246cdcf-b5ea-4c9e-9ccc-dd3c544587bb",
134134
"metadata": {
135135
"tags": []
@@ -138,7 +138,7 @@
138138
{
139139
"data": {
140140
"application/vnd.jupyter.widget-view+json": {
141-
"model_id": "3e939cd3a24742038b804bbfd961377d",
141+
"model_id": "cc396220c7144c8d8b195fb87694bbfe",
142142
"version_major": 2,
143143
"version_minor": 0
144144
},
@@ -489,7 +489,7 @@
489489
"name": "python",
490490
"nbconvert_exporter": "python",
491491
"pygments_lexer": "ipython3",
492-
"version": "3.8.10"
492+
"version": "3.10.12"
493493
},
494494
"widgets": {
495495
"application/vnd.jupyter.widget-state+json": {

expts/configs/config_gps_10M_pcqm4m.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,6 @@ datamodule:
112112
pos_type: rw_return_probs
113113
ksteps: 16
114114

115-
# cache_data_path: .
116115
num_workers: 0 # -1 to use all
117116
persistent_workers: False # if use persistent worker at the start of each epoch.
118117
# Using persistent_workers false might make the start of each epoch very long.

expts/configs/config_gps_10M_pcqm4m_mod.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,6 @@ datamodule:
8181
# Data handling-related
8282
batch_size_training: 64
8383
batch_size_inference: 16
84-
# cache_data_path: .
8584
num_workers: 0 # -1 to use all
8685
persistent_workers: False # if use persistent worker at the start of each epoch.
8786
# Using persistent_workers false might make the start of each epoch very long.

expts/configs/config_mpnn_10M_b3lyp.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ datamodule:
9393
featurization_progress: True
9494
featurization_backend: "loky"
9595
processed_graph_data_path: "../datacache/b3lyp/"
96+
dataloading_from: ram
9697
featurization:
9798
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
9899
# 'possible_number_radical_e', 'possible_is_aromatic', 'possible_is_in_ring',
@@ -123,7 +124,6 @@ datamodule:
123124
pos_type: rw_return_probs
124125
ksteps: 16
125126

126-
# cache_data_path: .
127127
num_workers: 0 # -1 to use all
128128
persistent_workers: False # if use persistent worker at the start of each epoch.
129129
# Using persistent_workers false might make the start of each epoch very long.

expts/configs/config_mpnn_pcqm4m.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ datamodule:
3030
featurization_n_jobs: 20
3131
featurization_progress: True
3232
featurization_backend: "loky"
33-
cache_data_path: "./datacache"
3433
processed_graph_data_path: "graphium/data/PCQM4Mv2/"
34+
dataloading_from: ram
3535
featurization:
3636
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
3737
# 'possible_number_radical_e', 'possible_is_aromatic', 'possible_is_in_ring',
@@ -58,7 +58,6 @@ datamodule:
5858
# Data handling-related
5959
batch_size_training: 64
6060
batch_size_inference: 16
61-
# cache_data_path: .
6261
num_workers: 40 # -1 to use all
6362
persistent_workers: False # if use persistent worker at the start of each epoch.
6463
# Using persistent_workers false might make the start of each epoch very long.

expts/hydra-configs/architecture/toymix.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ datamodule:
7979
featurization_progress: True
8080
featurization_backend: "loky"
8181
processed_graph_data_path: "../datacache/neurips2023-small/"
82+
dataloading_from: ram
8283
num_workers: 30 # -1 to use all
8384
persistent_workers: False
8485
featurization:

expts/neurips2023_configs/base_config/large.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,6 @@ datamodule:
168168
pos_type: rw_return_probs
169169
ksteps: 16
170170

171-
# cache_data_path: .
172171
num_workers: 32 # -1 to use all
173172
persistent_workers: True # if use persistent worker at the start of each epoch.
174173
# Using persistent_workers false might make the start of each epoch very long.

expts/neurips2023_configs/base_config/small.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,6 @@ datamodule:
132132
pos_type: rw_return_probs
133133
ksteps: 16
134134

135-
# cache_data_path: .
136135
num_workers: 30 # -1 to use all
137136
persistent_workers: False # if use persistent worker at the start of each epoch.
138137
# Using persistent_workers false might make the start of each epoch very long.

expts/neurips2023_configs/baseline/config_small_gcn_baseline.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,6 @@ datamodule:
131131
pos_type: rw_return_probs
132132
ksteps: 16
133133

134-
# cache_data_path: .
135134
num_workers: 30 # -1 to use all
136135
persistent_workers: False # if use persistent worker at the start of each epoch.
137136
# Using persistent_workers false might make the start of each epoch very long.

0 commit comments

Comments
 (0)