# Automated setup
python setup_script.py
# Or manual installation
pip install torch torch-geometric rdkit numpy pandas scikit-learn matplotlib seaborn tqdm
# Download all benchmark datasets
python download_datasets.py
# Verify datasets
python download_datasets.py --verify_only
# Run individual datasets
python main.py --dataset esol # Expected: 0.7844 RMSE
python main.py --dataset freesolv # Expected: 1.0124 RMSE
python main.py --dataset lipophilicity # Expected: 1.0221 RMSE
python main.py --dataset bace # Expected: 0.9586 RMSE
# Run ALL experiments from paper
python run_all_experiments.py
# Quick validation (reduced epochs)
python run_all_experiments.py --quick
# models/gsr.py - TChemGNN architecture
class GSR(nn.Module):
"""
35 features β 5 GAT layers β Node predictions
Key: tanh activation, 28 hidden dims, ~3.7K params
"""
# data/preprocessing.py - Paper-specified features
def create_atom_features(smiles_list):
"""
Returns 35 features per atom:
- 14 atomic-level (degree, atomic_number, etc.)
- 15 molecular-level (rings, aromaticity, etc.)
- 6 global 3D (volume, dipole, angle, etc.)
Note: logP excluded (key paper finding)
"""
# Paper-aligned training settings
SETTINGS = {
'optimizer': 'RMSprop',
'lr': 0.00075,
'scheduler': None, # Paper doesn't use
'grad_clipping': None, # Paper doesn't use
'activation': 'tanh',
'hidden_dim': 28,
'epochs': 5000
}
- Command:
python main.py --dataset {esol,freesolv,lipophilicity,bace}
- Purpose: Reproduce state-of-the-art results
- Expected: Match or exceed large foundation models
- Command:
python ablation_studies.py --experiment ablation
- Purpose: Analyze GNN architecture impact
- Key Finding: GAT + 3D features = optimal
- Command:
python ablation_studies.py --experiment node_positions
- Purpose: Compare pooling vs. no-pooling strategies
- Key Finding: Last node best for ESOL/FreeSolv
- Command: Compare
--with/without
global features - Purpose: Quantify 3D feature contribution
- Key Finding: 3D features crucial for performance
- Command:
python ablation_studies.py --experiment random_forest
- Purpose: Validate against classical ML
- Key Finding: Expert features competitive with deep learning
- Global 3D Features: Molecular geometry critical for GNN performance
- No-Pooling Strategy: SMILES-guided node selection outperforms pooling
- Efficiency: 3.7K parameters match/beat million-parameter models
- Expert Knowledge: Chemistry-informed design beats pure ML approaches
# Modify for your research
class CustomGSR(GSR):
def __init__(self, custom_features=None):
# Add your custom molecular features
# Experiment with different architectures
# Test on new datasets
- β Can small models beat large foundation models?
- β How important are 3D molecular features?
- β When is pooling detrimental to GNN performance?
- β What's the optimal GNN architecture for chemistry?
Dataset | Paper | Implementation |
---|---|---|
ESOL | 0.7844 | β Reproducible |
FreeSolv | 1.0124 | β Reproducible |
Lipophilicity | 1.0221 | β Reproducible |
BACE | 0.9586 | β Reproducible |
- Training Time: ~30 minutes per dataset (CPU)
- Memory: <2GB RAM
- Storage: <1GB for all datasets
- Hardware: Runs on laptop (GPU optional)
# Different architectures
python main.py --dataset esol --hidden_dim 64 --heads 4
# Training variations
python main.py --dataset esol --epochs 1000 --lr 0.001
# Pooling experiments
python main.py --dataset esol --pooling_strategy mean
- New Features: Modify
create_atom_features()
indata/preprocessing.py
- New Architectures: Extend
GSR
class inmodels/gsr.py
- New Datasets: Add config to
DATASET_CONFIG
inmain.py
- New Experiments: Follow patterns in
ablation_studies.py
- Training Curves: Loss/RMSE progression
- Predictions: Molecule-level results with errors
- Visualizations: Interactive training animations
- Analysis: Node-level prediction breakdowns
- Reports: Comprehensive experiment summaries
- All tables from paper reproducible
- Statistical significance validated
- Performance comparisons with SOTA
- Ablation studies quantified
- Runtime/efficiency analysis included
- No LR Scheduler: Paper specifically avoids this
- No Gradient Clipping: Paper finds it unnecessary
- logP Exclusion: Removes this feature (too predictive)
- SMILES Ordering: Leverages encoding structure for node selection
- 3D Computation: Uses RDKit for molecular geometry
- Fixed random seeds throughout
- Deterministic operations
- Version-controlled dependencies
- Paper-exact hyperparameters
This implementation provides a complete research platform for exploring GNN-based molecular property prediction with chemistry-informed design principles.
This code was written by Tetiana Lutchyn and refactored by Sebastian Iversen. The models were designed by Tetiana Lutchyn and Benjamin Ricaud. Training and experiments of the model(s) were performed by Tetiana Lutchyn. We thank Claude 3.7 for its help cleaning and refactoring the code.
The results will be published soon and a link to the paper will be added when available.