You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+48-2Lines changed: 48 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,9 +65,55 @@ The above step needs to be done once. After that, enable the SDK and the environ
65
65
source enable_ipu.sh .graphium_ipu
66
66
```
67
67
68
-
## The Graphium CLI
68
+
## Training a model
69
69
70
-
Installing `graphium` makes two CLI tools available: `graphium` and `graphium-train`. These CLI tools make it easy to access advanced functionality, such as _training a model_, _extracting fingerprints from a pre-trained model_ or _precomputing the dataset_. For more information, visit [the documentation](https://graphium-docs.datamol.io/stable/cli/reference.html).
70
+
To learn how to train a model, we invite you to look at the documentation, or the jupyter notebooks available [here](https://github.com/datamol-io/graphium/tree/master/docs/tutorials/model_training).
71
+
72
+
If you are not familiar with [PyTorch](https://pytorch.org/docs) or [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/), we highly recommend going through their tutorial first.
73
+
74
+
## Running an experiment
75
+
We have setup Graphium with `hydra` for managing config files. To run an experiment go to the `expts/` folder. For example, to benchmark a GCN on the ToyMix dataset run
76
+
```bash
77
+
graphium-train dataset=toymix model=gcn
78
+
```
79
+
To change parameters specific to this experiment like switching from `fp16` to `fp32` precision, you can either override them directly in the CLI via
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.
99
+
100
+
## Preparing the data in advance
101
+
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.
102
+
103
+
However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.
104
+
105
+
The following command-line will prepare the data and cache it, then use it to train a model.
106
+
```bash
107
+
# First prepare the data and cache it in `path_to_cached_data`
108
+
graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]
**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.
115
+
116
+
**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
We can now utilize `hydra` to e.g., run a sweep over our models on the ToyMix dataset via
39
39
@@ -43,7 +43,7 @@ graphium-train -m model=gcn,gin
43
43
where the ToyMix dataset is pre-configured in `main.yaml`. Read on to find out how to define new datasets and architectures for pre-training and fine-tuning.
44
44
45
45
## Pre-training / Fine-tuning
46
-
Say you trained a model with the following command:
46
+
Say you trained a model with the following command:
0 commit comments