Skip to content

uitml/TChemGNN

Β 
Β 

Repository files navigation

TChemGNN Research Implementation Guide

πŸš€ Quick Reproduction

1. Environment Setup

# Automated setup
python setup_script.py

# Or manual installation
pip install torch torch-geometric rdkit numpy pandas scikit-learn matplotlib seaborn tqdm

2. Dataset Preparation

# Download all benchmark datasets
python download_datasets.py

# Verify datasets
python download_datasets.py --verify_only

3. Single Experiment (Paper Results)

# Run individual datasets
python main.py --dataset esol      # Expected: 0.7844 RMSE
python main.py --dataset freesolv  # Expected: 1.0124 RMSE
python main.py --dataset lipophilicity # Expected: 1.0221 RMSE
python main.py --dataset bace      # Expected: 0.9586 RMSE

4. Complete Paper Reproduction

# Run ALL experiments from paper
python run_all_experiments.py

# Quick validation (reduced epochs)
python run_all_experiments.py --quick

πŸ“Š Key Research Components

Core Model Implementation

# models/gsr.py - TChemGNN architecture
class GSR(nn.Module):
    """
    35 features β†’ 5 GAT layers β†’ Node predictions
    Key: tanh activation, 28 hidden dims, ~3.7K params
    """

Feature Engineering (35D vector)

# data/preprocessing.py - Paper-specified features
def create_atom_features(smiles_list):
    """
    Returns 35 features per atom:
    - 14 atomic-level (degree, atomic_number, etc.)
    - 15 molecular-level (rings, aromaticity, etc.) 
    - 6 global 3D (volume, dipole, angle, etc.)
    Note: logP excluded (key paper finding)
    """

Training Configuration

# Paper-aligned training settings
SETTINGS = {
    'optimizer': 'RMSprop',
    'lr': 0.00075,
    'scheduler': None,        # Paper doesn't use
    'grad_clipping': None,    # Paper doesn't use
    'activation': 'tanh',
    'hidden_dim': 28,
    'epochs': 5000
}

πŸ§ͺ Research Experiments Available

1. Main Benchmarks (Tables 1-4)

  • Command: python main.py --dataset {esol,freesolv,lipophilicity,bace}
  • Purpose: Reproduce state-of-the-art results
  • Expected: Match or exceed large foundation models

2. Ablation Studies (Table 2)

  • Command: python ablation_studies.py --experiment ablation
  • Purpose: Analyze GNN architecture impact
  • Key Finding: GAT + 3D features = optimal

3. Pooling Analysis (Table 5)

  • Command: python ablation_studies.py --experiment node_positions
  • Purpose: Compare pooling vs. no-pooling strategies
  • Key Finding: Last node best for ESOL/FreeSolv

4. Feature Importance (Table 6)

  • Command: Compare --with/without global features
  • Purpose: Quantify 3D feature contribution
  • Key Finding: 3D features crucial for performance

5. Random Forest Baseline

  • Command: python ablation_studies.py --experiment random_forest
  • Purpose: Validate against classical ML
  • Key Finding: Expert features competitive with deep learning

πŸ” Research Insights & Extensions

Key Paper Contributions Implemented:

  1. Global 3D Features: Molecular geometry critical for GNN performance
  2. No-Pooling Strategy: SMILES-guided node selection outperforms pooling
  3. Efficiency: 3.7K parameters match/beat million-parameter models
  4. Expert Knowledge: Chemistry-informed design beats pure ML approaches

Extension Opportunities:

# Modify for your research
class CustomGSR(GSR):
    def __init__(self, custom_features=None):
        # Add your custom molecular features
        # Experiment with different architectures
        # Test on new datasets

Research Questions Addressed:

  • βœ… Can small models beat large foundation models?
  • βœ… How important are 3D molecular features?
  • βœ… When is pooling detrimental to GNN performance?
  • βœ… What's the optimal GNN architecture for chemistry?

πŸ“ˆ Performance Validation

Expected Results (RMSE):

Dataset Paper Implementation
ESOL 0.7844 βœ… Reproducible
FreeSolv 1.0124 βœ… Reproducible
Lipophilicity 1.0221 βœ… Reproducible
BACE 0.9586 βœ… Reproducible

Computational Requirements:

  • Training Time: ~30 minutes per dataset (CPU)
  • Memory: <2GB RAM
  • Storage: <1GB for all datasets
  • Hardware: Runs on laptop (GPU optional)

πŸ› οΈ Research Modifications

Easy Customizations:

# Different architectures
python main.py --dataset esol --hidden_dim 64 --heads 4

# Training variations  
python main.py --dataset esol --epochs 1000 --lr 0.001

# Pooling experiments
python main.py --dataset esol --pooling_strategy mean

Advanced Research:

  • New Features: Modify create_atom_features() in data/preprocessing.py
  • New Architectures: Extend GSR class in models/gsr.py
  • New Datasets: Add config to DATASET_CONFIG in main.py
  • New Experiments: Follow patterns in ablation_studies.py

πŸ“š Research Output

Generated Artifacts:

  • Training Curves: Loss/RMSE progression
  • Predictions: Molecule-level results with errors
  • Visualizations: Interactive training animations
  • Analysis: Node-level prediction breakdowns
  • Reports: Comprehensive experiment summaries

Publication-Ready Results:

  • All tables from paper reproducible
  • Statistical significance validated
  • Performance comparisons with SOTA
  • Ablation studies quantified
  • Runtime/efficiency analysis included

⚠️ Research Notes

Critical Implementation Details:

  1. No LR Scheduler: Paper specifically avoids this
  2. No Gradient Clipping: Paper finds it unnecessary
  3. logP Exclusion: Removes this feature (too predictive)
  4. SMILES Ordering: Leverages encoding structure for node selection
  5. 3D Computation: Uses RDKit for molecular geometry

Reproducibility:

  • Fixed random seeds throughout
  • Deterministic operations
  • Version-controlled dependencies
  • Paper-exact hyperparameters

This implementation provides a complete research platform for exploring GNN-based molecular property prediction with chemistry-informed design principles.


Credits

This code was written by Tetiana Lutchyn and refactored by Sebastian Iversen. The models were designed by Tetiana Lutchyn and Benjamin Ricaud. Training and experiments of the model(s) were performed by Tetiana Lutchyn. We thank Claude 3.7 for its help cleaning and refactoring the code.

The results will be published soon and a link to the paper will be added when available.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%