Skip to content

Commit 0d24634

Browse files
authored
Merge pull request #449 from datamol-io/caching
Various updates to graphium
2 parents c211dac + 3cf2fb5 commit 0d24634

40 files changed

+945
-36
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ datacache/
2626
tests/temp_cache*
2727
predictions/
2828
draft/
29+
scripts-expts/
2930

3031
# Data and predictions
3132
graphium/data/ZINC_bench_gnn/

README.md

Lines changed: 48 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,55 @@ The above step needs to be done once. After that, enable the SDK and the environ
6565
source enable_ipu.sh .graphium_ipu
6666
```
6767

68-
## The Graphium CLI
68+
## Training a model
6969

70-
Installing `graphium` makes two CLI tools available: `graphium` and `graphium-train`. These CLI tools make it easy to access advanced functionality, such as _training a model_, _extracting fingerprints from a pre-trained model_ or _precomputing the dataset_. For more information, visit [the documentation](https://graphium-docs.datamol.io/stable/cli/reference.html).
70+
To learn how to train a model, we invite you to look at the documentation, or the jupyter notebooks available [here](https://github.com/datamol-io/graphium/tree/master/docs/tutorials/model_training).
71+
72+
If you are not familiar with [PyTorch](https://pytorch.org/docs) or [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/), we highly recommend going through their tutorial first.
73+
74+
## Running an experiment
75+
We have setup Graphium with `hydra` for managing config files. To run an experiment go to the `expts/` folder. For example, to benchmark a GCN on the ToyMix dataset run
76+
```bash
77+
graphium-train dataset=toymix model=gcn
78+
```
79+
To change parameters specific to this experiment like switching from `fp16` to `fp32` precision, you can either override them directly in the CLI via
80+
```bash
81+
graphium-train dataset=toymix model=gcn trainer.trainer.precision=32
82+
```
83+
or change them permamently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
84+
Integrating `hydra` also allows you to quickly switch between accelerators. E.g., running
85+
```bash
86+
graphium-train dataset=toymix model=gcn accelerator=gpu
87+
```
88+
automatically selects the correct configs to run the experiment on GPU.
89+
Finally, you can also run a fine-tuning loop:
90+
```bash
91+
graphium-train +finetuning=admet
92+
```
93+
94+
To use a config file you built from scratch you can run
95+
```bash
96+
graphium-train --config-path [PATH] --config-name [CONFIG]
97+
```
98+
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.
99+
100+
## Preparing the data in advance
101+
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.
102+
103+
However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.
104+
105+
The following command-line will prepare the data and cache it, then use it to train a model.
106+
```bash
107+
# First prepare the data and cache it in `path_to_cached_data`
108+
graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]
109+
110+
# Then train the model on the prepared data
111+
graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
112+
```
113+
114+
**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.
115+
116+
**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
71117

72118
## License
73119

expts/hydra-configs/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ constants:
3333
3434
trainer:
3535
model_checkpoint:
36-
dirpath: models_checkpoints/neurips2023-small-gin/
36+
dirpath: models_checkpoints/neurips2023-small-gin/${now:%Y-%m-%d_%H-%M-%S}/
3737
```
3838
We can now utilize `hydra` to e.g., run a sweep over our models on the ToyMix dataset via
3939

@@ -43,7 +43,7 @@ graphium-train -m model=gcn,gin
4343
where the ToyMix dataset is pre-configured in `main.yaml`. Read on to find out how to define new datasets and architectures for pre-training and fine-tuning.
4444

4545
## Pre-training / Fine-tuning
46-
Say you trained a model with the following command:
46+
Say you trained a model with the following command:
4747
```bash
4848
graphium-train --config-name "main"
4949
```
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# @package _global_
2+
3+
architecture:
4+
model_type: FullGraphMultiTaskNetwork
5+
mup_base_path: null
6+
pre_nn: # Set as null to avoid a pre-nn network
7+
out_dim: 64
8+
hidden_dims: 256
9+
depth: 2
10+
activation: relu
11+
last_activation: none
12+
dropout: &dropout 0.1
13+
normalization: &normalization layer_norm
14+
last_normalization: *normalization
15+
residual_type: none
16+
17+
pre_nn_edges: null
18+
19+
pe_encoders:
20+
out_dim: 32
21+
pool: "sum" #"mean" "max"
22+
last_norm: None #"batch_norm", "layer_norm"
23+
encoders: #la_pos | rw_pos
24+
la_pos: # Set as null to avoid a pre-nn network
25+
encoder_type: "laplacian_pe"
26+
input_keys: ["laplacian_eigvec", "laplacian_eigval"]
27+
output_keys: ["feat"]
28+
hidden_dim: 64
29+
out_dim: 32
30+
model_type: 'DeepSet' #'Transformer' or 'DeepSet'
31+
num_layers: 2
32+
num_layers_post: 1 # Num. layers to apply after pooling
33+
dropout: 0.1
34+
first_normalization: "none" #"batch_norm" or "layer_norm"
35+
rw_pos:
36+
encoder_type: "mlp"
37+
input_keys: ["rw_return_probs"]
38+
output_keys: ["feat"]
39+
hidden_dim: 64
40+
out_dim: 32
41+
num_layers: 2
42+
dropout: 0.1
43+
normalization: "layer_norm" #"batch_norm" or "layer_norm"
44+
first_normalization: "layer_norm" #"batch_norm" or "layer_norm"
45+
46+
gnn: # Set as null to avoid a post-nn network
47+
in_dim: 64 # or otherwise the correct value
48+
out_dim: &gnn_dim 768
49+
hidden_dims: *gnn_dim
50+
depth: 4
51+
activation: gelu
52+
last_activation: none
53+
dropout: 0.1
54+
normalization: "layer_norm"
55+
last_normalization: *normalization
56+
residual_type: simple
57+
virtual_node: 'none'
58+
59+
graph_output_nn:
60+
graph:
61+
pooling: [sum]
62+
out_dim: *gnn_dim
63+
hidden_dims: *gnn_dim
64+
depth: 1
65+
activation: relu
66+
last_activation: none
67+
dropout: *dropout
68+
normalization: *normalization
69+
last_normalization: "none"
70+
residual_type: none
71+
node:
72+
pooling: [sum]
73+
out_dim: *gnn_dim
74+
hidden_dims: *gnn_dim
75+
depth: 1
76+
activation: relu
77+
last_activation: none
78+
dropout: *dropout
79+
normalization: *normalization
80+
last_normalization: "none"
81+
residual_type: none
82+
83+
datamodule:
84+
module_type: "MultitaskFromSmilesDataModule"
85+
args:
86+
prepare_dict_or_graph: pyg:graph
87+
featurization_n_jobs: 20
88+
featurization_progress: True
89+
featurization_backend: "loky"
90+
processed_graph_data_path: "../datacache/large-dataset/"
91+
dataloading_from: "disk"
92+
num_workers: 20 # -1 to use all
93+
persistent_workers: True
94+
featurization:
95+
atom_property_list_onehot: [atomic-number, group, period, total-valence]
96+
atom_property_list_float: [degree, formal-charge, radical-electron, aromatic, in-ring]
97+
edge_property_list: [bond-type-onehot, stereo, in-ring]
98+
add_self_loop: False
99+
explicit_H: False # if H is included
100+
use_bonds_weights: False
101+
pos_encoding_as_features: # encoder dropout 0.18
102+
pos_types:
103+
lap_eigvec:
104+
pos_level: node
105+
pos_type: laplacian_eigvec
106+
num_pos: 8
107+
normalization: "none" # nomrlization already applied on the eigen vectors
108+
disconnected_comp: True # if eigen values/vector for disconnected graph are included
109+
lap_eigval:
110+
pos_level: node
111+
pos_type: laplacian_eigval
112+
num_pos: 8
113+
normalization: "none" # nomrlization already applied on the eigen vectors
114+
disconnected_comp: True # if eigen values/vector for disconnected graph are included
115+
rw_pos: # use same name as pe_encoder
116+
pos_level: node
117+
pos_type: rw_return_probs
118+
ksteps: 16

expts/hydra-configs/experiment/toymix_mpnn.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,4 @@ constants:
1010

1111
trainer:
1212
model_checkpoint:
13-
dirpath: models_checkpoints/neurips2023-small-mpnn/
13+
dirpath: models_checkpoints/neurips2023-small-mpnn/${now:%Y-%m-%d_%H-%M-%S}/
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# @package _global_
2+
3+
architecture:
4+
pre_nn_edges: # Set as null to avoid a pre-nn network
5+
out_dim: 32
6+
hidden_dims: 128
7+
depth: 2
8+
activation: relu
9+
last_activation: none
10+
dropout: ${architecture.pre_nn.dropout}
11+
normalization: ${architecture.pre_nn.normalization}
12+
last_normalization: ${architecture.pre_nn.normalization}
13+
residual_type: none
14+
15+
gnn:
16+
out_dim: &gnn_dim 704
17+
hidden_dims: *gnn_dim
18+
layer_type: 'pyg:gine'
19+
20+
graph_output_nn:
21+
graph:
22+
out_dim: *gnn_dim
23+
hidden_dims: *gnn_dim
24+
node:
25+
out_dim: *gnn_dim
26+
hidden_dims: *gnn_dim
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# NOTE: We cannot have a single config, since for fine-tuning we will
2+
# only want to override the loss_metrics_datamodule, whereas for training we will
3+
# want to override both.
4+
5+
defaults:
6+
- task_heads: l1000_mcf7
7+
- loss_metrics_datamodule: l1000_mcf7
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# NOTE: We cannot have a single config, since for fine-tuning we will
2+
# only want to override the loss_metrics_datamodule, whereas for training we will
3+
# want to override both.
4+
5+
defaults:
6+
- task_heads: l1000_vcap
7+
- loss_metrics_datamodule: l1000_vcap
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# NOTE: We cannot have a single config, since for fine-tuning we will
2+
# only want to override the loss_metrics_datamodule, whereas for training we will
3+
# want to override both.
4+
5+
defaults:
6+
- task_heads: largemix
7+
- loss_metrics_datamodule: largemix
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# @package _global_
2+
3+
predictor:
4+
metrics_on_progress_bar:
5+
l1000_mcf7: []
6+
metrics_on_training_set:
7+
l1000_mcf7: []
8+
loss_fun:
9+
l1000_mcf7:
10+
name: hybrid_ce_ipu
11+
n_brackets: 3
12+
alpha: 0.5
13+
14+
metrics:
15+
l1000_mcf7:
16+
- name: auroc
17+
metric: auroc
18+
num_classes: 3
19+
task: multiclass
20+
target_to_int: True
21+
target_nan_mask: -1000
22+
ignore_index: -1000
23+
multitask_handling: mean-per-label
24+
threshold_kwargs: null
25+
- name: avpr
26+
metric: averageprecision
27+
num_classes: 3
28+
task: multiclass
29+
target_to_int: True
30+
target_nan_mask: -1000
31+
ignore_index: -1000
32+
multitask_handling: mean-per-label
33+
threshold_kwargs: null
34+
35+
datamodule:
36+
args: # Matches that in the test_multitask_datamodule.py case.
37+
task_specific_args: # To be replaced by a new class "DatasetParams"
38+
l1000_mcf7:
39+
df: null
40+
df_path: ../data/graphium/large-dataset/LINCS_L1000_MCF7_0-2_th2.csv.gz
41+
# wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/LINCS_L1000_MCF7_0-4.csv.gz
42+
# or set path as the URL directly
43+
smiles_col: "SMILES"
44+
label_cols: geneID-* # geneID-* means all columns starting with "geneID-"
45+
# sample_size: 2000 # use sample_size for test
46+
task_level: graph
47+
splits_path: ../data/graphium/large-dataset/l1000_mcf7_random_splits.pt # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/l1000_mcf7_random_splits.pt`
48+
# split_names: [train, val, test_seen]
49+
epoch_sampling_fraction: 1.0

0 commit comments

Comments
 (0)