Skip to content

Commit 64c1cde

Browse files
authored
Merge pull request #438 from datamol-io/website
Updating the website
2 parents 697d7f2 + 56b4a43 commit 64c1cde

File tree

10 files changed

+167
-126
lines changed

10 files changed

+167
-126
lines changed

docs/baseline.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,10 @@ One can observe that the smaller datasets (`Zinc12k` and `Tox21`) beneficiate fr
2323
| **Tox21** | GCN | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | **0.176 ± 0.001** | **0.850 ± 0.006** | 0.446 ± 0.01 |
2424
| | GIN | 0.200 ± 0.002 | 0.789 ± 0.009 | 0.350 ± 0.01 | 0.176 ± 0.001 | 0.841 ± 0.005 | 0.454 ± 0.009 |
2525
| | GINE | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | **0.455 ± 0.008** |
26+
27+
# LargeMix Baseline
28+
Coming soon!
29+
30+
# UltraLarge Baseline
31+
Coming soon!
32+

docs/contribute.md

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,36 @@
11
# Contribute
22

3-
The below documents the development lifecycle of Graphium.
3+
We are happy to see that you want to contribute 🤗.
4+
Feel free to open an issue or pull request at any time. But first, follow this page to install Graphium in dev mode.
45

5-
## Setup a dev environment
6+
## Installation for developers
7+
8+
### For CPU and GPU developers
9+
10+
Use [`mamba`](https://github.com/mamba-org/mamba), a preferred alternative to conda, to create your environment:
611

712
```bash
8-
mamba env create -n graphium -f env.yml
9-
mamba activate graphium
13+
# Install Graphium's dependencies in a new environment named `graphium`
14+
mamba env create -f env.yml -n graphium
1015

16+
# Install Graphium in dev mode
17+
mamba activate graphium
1118
pip install --no-deps -e .
1219
```
1320

14-
## Run tests
21+
### For IPU developers
22+
23+
Download the SDK and use pypi to create your environment:
24+
25+
```bash
26+
# Install Graphcore's SDK and Graphium dependencies in a new environment called `.graphium_ipu`
27+
./install_ipu.sh .graphium_ipu
28+
```
29+
30+
The above step needs to be done once. After that, enable the SDK and the environment as follows:
1531

1632
```bash
17-
pytest
33+
source enable_ipu.sh .graphium_ipu
1834
```
1935

2036
## Build the documentation
@@ -23,5 +39,5 @@ You can build and serve the documentation locally with:
2339

2440
```bash
2541
# Build and serve the doc
26-
mike serve
42+
mkdocs serve
2743
```

docs/dataset_abstract.png

241 KB
Loading

docs/datasets.md

Lines changed: 68 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,78 @@
11
# Graphium Datasets
22

3-
Graphium datasets are hosted at on Google Cloud Storage at `gs://graphium-public/datasets`. Graphium provides a convenient utility functions to list and download those datasets:
3+
Graphium datasets are hosted at on Zenodo on [this link](https://zenodo.org/record/8206704).
44

5-
```python
6-
import graphium
5+
Instead of provinding datasets as a single entity, our aim is to provide dataset mixes containing a variety of datasets that are meant to be predicted simultaneously using multi-tasking.
76

8-
dataset_dir = "/my/path"
9-
data_path = graphium.data.utils.download_graphium_dataset("graphium-zinc-micro", output_path=dataset_dir)
10-
print(data_path)
11-
# /my/path/graphium-zinc-micro
12-
```
7+
They are visually described in this image, with detailed description below.
8+
![Visual description of the ToyMix, LargeMix, UltraLarge datasets](dataset_abstract.png)
139

14-
## `graphium-zinc-micro`
10+
## ToyMix (QM9 + Tox21 + Zinc12K)
1511

16-
ADD DESCRIPTION.
12+
The ***ToyMix*** dataset combines the ***QM9***, ***Tox21***, and ***Zinc12K*** datasets. These datasets are well-known in the literature and used as toy datasets, or very simple datasets, in various contexts to enable fast iterations of models. By regrouping toy datasets from quantum ML, drug discovery, and GNN expressivity, we hope that the learned model will be representative of the model performance we can expect on the larger datasets.
1713

18-
- Number of molecules: xxx
19-
- Label columns: xxx
20-
- Split informations.
14+
### Train/Validation/Test Splits
15+
for all the datasets in ***ToyMix*** are split randomly with a ratio of 0.8/0.1/0.1. Random splitting is used since it is the simplest and fits the idea of having a toy dataset well.
2116

22-
## `graphium-zinc-bench-gnn`
17+
### QM9
18+
is a well-known dataset in the field of 3D GNNs. It consists of 19 graph-level quantum properties associated to an energy-minimized 3D conformation of the molecules [1]. It is considered a simple dataset since all the molecules have at most 9 heavy atoms. We chose QM9 in our ***ToyMix*** since it is very similar to the larger proposed quantum datasets, PCQM4M\_multitask and PM6\_83M, but with smaller molecules.
2319

24-
ADD DESCRIPTION.
2520

26-
- Number of molecules: xxx
27-
- Label columns: xxx- Split informations.
21+
### Tox21
22+
is a well-known dataset for researchers in machine learning for drug discovery [2]. It consists of a multi-label classification task with 12 labels, with most labels missing and a strong imbalance towards the negative class. We chose ***Tox21*** in our ***ToyMix*** since it is very similar to the larger proposed bioassay dataset, ***PCBA\_1328\_1564k*** both in terms of sparsity and imbalance and to the ***L1000*** datasets in terms of imbalance.
23+
24+
### ZINC12k
25+
is a well-known dataset for researchers in GNN expressivity [3]. We include it in our ***ToyMix*** since GNN expressivity is very important for performance on large-scale data. Hence, we hope that the performance on this task will correlate well with the performance when scaling.
26+
27+
## LargeMix (PCQM4M + PCBA1328 + L1000)
28+
In this section, we present the ***LargeMix*** dataset, comprised of four different datasets with tasks taken from quantum chemistry (***PCQM4M***), bio-assays (***PCBA***) and transcriptomics.
29+
30+
### Train/validation/test/test\_seen
31+
Splits For the ***PCQM4M\_G25\_N4***, we create a 0.92/0.04/0.04 split. Then, for all the other datasets in ***LargeMix***, we first create a "test\_seen" split by taking the set of molecules from ***L1000*** and ***PCBA1328*** that are also present in the training set of ***PCQM4M\_G25\_N4***, such that we can evaluate whether having the quantum properties of a molecule helps generalize for biological properties. For the remaining parts, we split randomly with a ratio of 0.92/0.04/0.04.
32+
33+
34+
35+
### L1000 VCAP and MCF7
36+
The ***LINCS L1000*** is a database of high-throughput transcriptomics that screened more than 30,000 perturbations on a set of 978 landmark genes [4] from multiple cell lines. ***VCAP*** and ***MCF7*** are, respectively, prostate cancer and human breast cancer cell lines. In ***L1000***, most of the perturbagens are chemical, meaning that small drug-like molecules are added to the cell lines to observe how the gene expressions change. This allows to generate biological signatures of the molecules, which are known to correlate with drug activity and side effects.
37+
38+
39+
To process the data into our two datasets comprising the ***VCAP*** and ***MCF7*** cell lines, we used their "level 5" data composed of the cleanup data converted to z-scores, and filtered to keep only chemical perturbagens. However, we were left with multiple data points per molecule since some variables could change (e.g., incubation time) and generate a new measure. Given our objective of generating a single signature per molecule, we decided to take the measurements with the strongest global activity such that the variance over the 978 genes is maximal. Then, since these signatures are generally noisy, we binned them into five classes corresponding to z-scores based on the thresholds $\{-4, -2, 2, 4\}$.
40+
41+
The cell lines ***VCAP*** and ***MCF7*** were selected since they have a higher number of unique molecule perturbagens than other cell lines. They also have a relatively lower data imbalance, with ~92% falling in the "neutral class" when the z-score was between -2 and 2.
42+
43+
### PCBA1328
44+
This dataset is very similar to the ***OGBG-PCBA*** dataset [5], but instead of being limited to 128 assays and 437k molecules, it comprises 1,328 assays and 1.56M molecules. This dataset is very interesting for pre-training molecular models since it contains information about a molecule's behavior in various settings relevant to biochemists, with evidence that it improves binding predictions. Analogous to the gene expression, we obtain a bio-assay-expression of each molecule.
45+
46+
To gather the data, we have looped over the PubChem index of bioassays [6] and collected every dataset such that it contains more than 6,000 molecules annotated with either ``Active'' or ``Inactive'' and at least 10 of each. Then, we converted all the molecular IDs to canonical SMILES and used it to merge all of the bioassays into a single dataset.
47+
48+
### PCQM4M\_G25\_N4
49+
This dataset comes from the same data source as the ***OGBG-PCQM4M*** dataset, famously known for being part of the OGB large-scale challenge [7] and being one of the only graph datasets where pure Transformers have proven successful. The data source is the PubChemQC project [8] that computed DFT properties on the energy-minimized conformation of 3.8M small molecules from PubChem.
50+
51+
Contrarily to the OGB challenge, we aim to provide enough data for pre-training GNNs, so we do not limit ourselves to the HOMO-LUMO gap prediction [7]. Instead, we gather properties directly given by the DFT (e.g., energies) and compute other 3D descriptors from the conformation (e.g., inertia, the plane of best fit). We also gather node-level properties, the Mulliken and Lowdin charges at each atom. Furthermore, about half of the molecules have time-dependent DFT to help inform about the molecule's excited state. Looking forward, we plan on adding edge-level tasks to enable the prediction of bond properties, such as their lengths and the gradient of the charges.
52+
53+
54+
## UltraLarge Dataset
55+
### PM6\_83M
56+
This dataset is similar to the ***PCQM4M*** and comes from the same PubChemQC project. However, it uses the PM6 semi-empirical computation of the quantum properties, which is orders of magnitude faster than DFT computation at the expense of less accuracy [8, 9].
57+
58+
This dataset covers 83M unique molecules, 62 graph-level tasks, and 7 node-level tasks. To our knowledge, this is the largest dataset available for training 2D-GNNs regarding the number of unique molecules. The various tasks come from four different molecular states, namely ``S0'' for the ground state, ``T0'' for the lowest energy triplet excited state, ``cation'' for the positively charged state, and ``anion'' for the negatively charged state. In total, there are 221M PM6 computations.
59+
60+
## References
61+
[1] https://www.nature.com/articles/sdata201422/
62+
63+
[2] https://europepmc.org/article/MED/23603828
64+
65+
[3] https://arxiv.org/abs/2003.00982v3
66+
67+
[4] https://pubmed.ncbi.nlm.nih.gov/29195078/
68+
69+
[5] https://arxiv.org/abs/2005.00687
70+
71+
[6] https://pubmed.ncbi.nlm.nih.gov/26400175/
72+
73+
[7] https://arxiv.org/abs/2103.09430
74+
75+
[8] https://pubs.acs.org/doi/10.1021/acs.jcim.7b00083
76+
77+
[9] https://arxiv.org/abs/1904.06046
78+

docs/design.md

Lines changed: 33 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -2,102 +2,64 @@
22

33
---
44

5-
### Diagram for data processing in molGPS.
6-
7-
<img src="images/datamodule.png" alt= "Data Processing Chart" width="100%" height="100%">
8-
9-
10-
11-
### Diagram for Muti-task network in molGPS
12-
13-
<img src="images/full_graph_network.png" alt= "Full Graph Multi-task Network" width="100%" height="100%">
145

6+
The library is designed with 3 things in mind:
157

8+
- High modularity and configurability with *YAML* files
9+
- Contain the state-of-the art GNNs, including positional encodings and graph Transformers
10+
- Massively multitasking across diverse and sparse datasets
1611

12+
The current page will walk you through the different aspects of the design that enable that.
1713

14+
### Diagram for data processing in Graphium.
1815

16+
First, when working with molecules, there are tons of options regarding atomic and bond featurisation that can be extracted from the periodic table, from empirical results, or from simulated 3D structures.
1917

20-
**Section from the previous README:**
18+
Second, when working with graph Transformers, there are plenty of options regarding the positional and structural encodings (PSE) that are fundamental in driving the accuracy and the generalization of the models.
2119

22-
### Data setup
20+
With this in mind, we propose a very versatile chemical and PSE encoding, alongside an encoder manager, that can be fully configured from the yaml files. The idea is to assign matching *input keys* to both the features and the encoders, then pool the outputs according to the *output keys*. It is better summarized in the image below.
2321

24-
Then, you need to download the data needed to run the code. Right now, we have 2 sets of data folders, present in the link [here](https://drive.google.com/drive/folders/1RrbNZkEE2rf41_iroa1LbIyegW00h3Ql?usp=sharing).
22+
<img src="images/datamodule.png" alt= "Data Processing Chart" width="100%" height="100%">
2523

26-
- **micro_ZINC** (Synthetic dataset)
27-
- A small subset (1000 mols) of the ZINC dataset
28-
- The score is the subtraction of the computed LogP and the synthetic accessibility score SA
29-
- The data must be downloaded to the folder `./graphium/data/micro_ZINC/`
3024

31-
- **ZINC_bench_gnn** (Synthetic dataset)
32-
- A subset (12000 mols) of the ZINC dataset
33-
- The score is the subtraction of the computed LogP and the synthetic accessibility score SA
34-
- These are the same 12k molecules provided by the [Benchmarking-gnn](https://github.com/graphdeeplearning/benchmarking-gnns) repository.
35-
- We provide the pre-processed graphs in `ZINC_bench_gnn/data_from_benchmark`
36-
- We provide the SMILES in `ZINC_bench_gnn/smiles_score.csv`, with the train-val-test indexes in the file `indexes_train_val_test.csv`.
37-
- The first 10k elements are the training set
38-
- The next 1k the valid set
39-
- The last 1k the test set.
40-
- The data must be downloaded to the folder `./graphium/data/ZINC_bench_gnn/`
4125

42-
Then, you can run the main file to make sure that all the dependancies are correctly installed and that the code works as expected.
26+
### Diagram for Muti-task network in Graphium
4327

44-
```bash
45-
python expts/main_micro_zinc.py
46-
```
28+
As mentioned, we want to be able to pperform massive multi-tasking to enable us to work across a huge diversity of datasets. The idea is to use a combination of multiple task-heads, where a different MLP is applied to each task predictions. However, it is also designed such that each task can have as many labels as desired, thus enabling to group labels together according to whether they should share weights/losses.
4729

48-
---
30+
The design is better explained in the image below. Notice how the *keys* outputed by GraphDict are used differently across the different GNN layers.
4931

50-
**TODO: explain the internal design of Graphium so people can contribute to it more easily.**
32+
<img src="images/full_graph_network.png" alt= "Full Graph Multi-task Network" width="100%" height="100%">
5133

5234
## Structure of the code
5335

5436
The code is built to rapidly iterate on different architectures of neural networks (NN) and graph neural networks (GNN) with Pytorch. The main focus of this work is molecular tasks, and we use the package `rdkit` to transform molecular SMILES into graphs.
5537

56-
### data_parser
57-
58-
This folder contains tools that allow tdependenciesrent kind of molecular data files, such as `.csv` or `.xlsx` with SMILES data, or `.sdf` files with 3D data.
59-
60-
61-
### features
62-
63-
Different utilities for molecules, such as Smiles to adjacency graph transformer, molecular property extraction, atomic properties, bond properties, ...
64-
65-
**_The MolecularTransformer and AdjGraphTransformer come from ivbase, but I don't like them. I think we should replace them with something simpler and give more flexibility for combining one-hot embedding with physical properties embedding._**.
66-
67-
### trainer
68-
69-
The trainer contains the interface to the `pytorch-lightning` library, with `PredictorModule` being the main class used for any NN model, either for regression or classification. It also contains some modifications to the logger from `pytorch-lightning` to enable more flexibility.
70-
71-
### utils
72-
73-
Any kind of utilities that can be used anywhere, including argument checkers and configuration loader
74-
75-
### visualization
76-
77-
Plot visualization tools
78-
79-
## Modifying the code
80-
81-
### Adding a new GNN layer
82-
83-
Any new GNN layer must inherit from the class `graphium.nn.base_graph_layer.BaseGraphLayer` and be implemented in the folder `graphium/nn/pyg_layers`, imported in the file `graphium/nn/architectures.py`, and in the same file, added to the function `FeedForwardGraph._parse_gnn_layer`.
84-
85-
To be used in the configuration file as a `graphium.model.layer_name`, it must also be implemented with some variable parameters in the file `expts/config_gnns.yaml`.
38+
Below are a list of directory and their respective documentations:
8639

87-
### Adding a new NN architecture
40+
- cli
41+
- [config](https://github.com/datamol-io/graphium/blob/main/graphium/config/README.md)
42+
- [data](https://github.com/datamol-io/graphium/blob/main/graphium/data/README.md)
43+
- [features](https://github.com/datamol-io/graphium/tree/main/graphium/features/README.md)
44+
- finetuning
45+
- [ipu](https://github.com/datamol-io/graphium/tree/main/graphium/ipu/README.md)
46+
- [nn](https://github.com/datamol-io/graphium/tree/main/graphium/nn/README.md)
47+
- [trainer](https://github.com/datamol-io/graphium/tree/main/graphium/trainer/README.md)
48+
- [utils](https://github.com/datamol-io/graphium/tree/main/graphium/features/README.md)
49+
- [visualization](https://github.com/datamol-io/graphium/tree/main/graphium/visualization/README.md)
8850

89-
All NN and GNN architectures compatible with the `pyg` library are provided in the file `graphium/nn/global_architectures.py`. When implementing a new architecture, it is highly recommended to inherit from `graphium.nn.architectures.FeedForwardNN` for regular neural networks, from `graphium.nn.global_architectures.FeedForwardGraph` for pyg neural network, or from any of their sub-classes.
9051

91-
### Changing the PredictorModule and loss function
52+
## Structure of the configs
9253

93-
The `PredictorModule` is a general pytorch-lightning module that should work with any kind of `pytorch.nn.Module` or `pl.LightningModule`. The class defines a structure of including models, loss functions, batch sizes, collate functions, metrics...
54+
Making the library very modular requires to have configuration files that have >200 lines, which becomes intractable, especially when we only want to have minor changes between configurations.
9455

95-
Some loss functions are already implemented in the PredictorModule, including `mse, bce, mae, cosine`, but some tasks will require more complex loss functions. One can add any new function in `graphium.trainer.predictor.PredictorModule._parse_loss_fun`.
56+
Hence, we use [hydra](https://hydra.cc/docs/intro/) to enable splitting the configurations into smaller and composable configuration files.
9657

97-
### Changing the metrics used
58+
Examples of possibilities include:
9859

99-
**_!WARNING! The metrics implementation was done for pytorch-lightning v0.8. There has been major changes to how the metrics are used and defined, so the whole implementation must change._**
60+
- Switching between accelerators (CPU, GPU and IPU)
61+
- Benchmarking different models on the same dataset
62+
- Fine-tuning a pre-trained model on a new dataset
10063

101-
Our current code is compatible with the metrics defined by _pytorch-lightning_, which include a great set of metrics. We also added the PearsonR and SpearmanR as they are important correlation metrics. You can define any new metric in the file `graphium/trainer/metrics.py`. The metric must inherit from `TensorMetric` and must be added to the dictionary `graphium.trainer.metrics.METRICS_DICT`.
64+
[In this document](https://github.com/datamol-io/graphium/tree/main/expts/hydra-configs#readme), we describe in details how each of the above functionality is achieved and how users can benefit from this design to achieve the most with Graphium with as little configuration as possible.
10265

103-
To use the metric, you can easily add it's name from `METRICS_DICT` in the yaml configuration file, at the address `metrics.metrics_dict`. Each metric has an underlying dictionnary with a mandatory `threshold` key containing information on how to threshold the prediction/target before computing the metric. Any `kwargs` arguments of the metric must also be added.

0 commit comments

Comments
 (0)