TChemGNN Research Implementation Guide

🚀 Quick Reproduction

1. Environment Setup

# Automated setup
python setup_script.py

# Or manual installation
pip install torch torch-geometric rdkit numpy pandas scikit-learn matplotlib seaborn tqdm

2. Dataset Preparation

# Download all benchmark datasets
python download_datasets.py

# Verify datasets
python download_datasets.py --verify_only

3. Single Experiment (Paper Results)

# Run individual datasets
python main.py --dataset esol      # Expected: 0.7844 RMSE
python main.py --dataset freesolv  # Expected: 1.0124 RMSE
python main.py --dataset lipophilicity # Expected: 1.0221 RMSE
python main.py --dataset bace      # Expected: 0.9586 RMSE

4. Complete Paper Reproduction

# Run ALL experiments from paper
python run_all_experiments.py

# Quick validation (reduced epochs)
python run_all_experiments.py --quick

📊 Key Research Components

Core Model Implementation

# models/gsr.py - TChemGNN architecture
class GSR(nn.Module):
    """
    35 features → 5 GAT layers → Node predictions
    Key: tanh activation, 28 hidden dims, ~3.7K params
    """

Feature Engineering (35D vector)

# data/preprocessing.py - Paper-specified features
def create_atom_features(smiles_list):
    """
    Returns 35 features per atom:
    - 14 atomic-level (degree, atomic_number, etc.)
    - 15 molecular-level (rings, aromaticity, etc.) 
    - 6 global 3D (volume, dipole, angle, etc.)
    Note: logP excluded (key paper finding)
    """

Training Configuration

# Paper-aligned training settings
SETTINGS = {
    'optimizer': 'RMSprop',
    'lr': 0.00075,
    'scheduler': None,        # Paper doesn't use
    'grad_clipping': None,    # Paper doesn't use
    'activation': 'tanh',
    'hidden_dim': 28,
    'epochs': 5000
}

🧪 Research Experiments Available

1. Main Benchmarks (Tables 1-4)

Command: python main.py --dataset {esol,freesolv,lipophilicity,bace}
Purpose: Reproduce state-of-the-art results
Expected: Match or exceed large foundation models

2. Ablation Studies (Table 2)

Command: python ablation_studies.py --experiment ablation
Purpose: Analyze GNN architecture impact
Key Finding: GAT + 3D features = optimal

3. Pooling Analysis (Table 5)

Command: python ablation_studies.py --experiment node_positions
Purpose: Compare pooling vs. no-pooling strategies
Key Finding: Last node best for ESOL/FreeSolv

4. Feature Importance (Table 6)

Command: Compare --with/without global features
Purpose: Quantify 3D feature contribution
Key Finding: 3D features crucial for performance

5. Random Forest Baseline

Command: python ablation_studies.py --experiment random_forest
Purpose: Validate against classical ML
Key Finding: Expert features competitive with deep learning

🔍 Research Insights & Extensions

Key Paper Contributions Implemented:

Global 3D Features: Molecular geometry critical for GNN performance
No-Pooling Strategy: SMILES-guided node selection outperforms pooling
Efficiency: 3.7K parameters match/beat million-parameter models
Expert Knowledge: Chemistry-informed design beats pure ML approaches

Extension Opportunities:

# Modify for your research
class CustomGSR(GSR):
    def __init__(self, custom_features=None):
        # Add your custom molecular features
        # Experiment with different architectures
        # Test on new datasets

Research Questions Addressed:

✅ Can small models beat large foundation models?
✅ How important are 3D molecular features?
✅ When is pooling detrimental to GNN performance?
✅ What's the optimal GNN architecture for chemistry?

📈 Performance Validation

Expected Results (RMSE):

Dataset	Paper	Implementation
ESOL	0.7844	✅ Reproducible
FreeSolv	1.0124	✅ Reproducible
Lipophilicity	1.0221	✅ Reproducible
BACE	0.9586	✅ Reproducible

Computational Requirements:

Training Time: ~30 minutes per dataset (CPU)
Memory: <2GB RAM
Storage: <1GB for all datasets
Hardware: Runs on laptop (GPU optional)

🛠️ Research Modifications

Easy Customizations:

# Different architectures
python main.py --dataset esol --hidden_dim 64 --heads 4

# Training variations  
python main.py --dataset esol --epochs 1000 --lr 0.001

# Pooling experiments
python main.py --dataset esol --pooling_strategy mean

Advanced Research:

New Features: Modify create_atom_features() in data/preprocessing.py
New Architectures: Extend GSR class in models/gsr.py
New Datasets: Add config to DATASET_CONFIG in main.py
New Experiments: Follow patterns in ablation_studies.py

📚 Research Output

Generated Artifacts:

Training Curves: Loss/RMSE progression
Predictions: Molecule-level results with errors
Visualizations: Interactive training animations
Analysis: Node-level prediction breakdowns
Reports: Comprehensive experiment summaries

Publication-Ready Results:

All tables from paper reproducible
Statistical significance validated
Performance comparisons with SOTA
Ablation studies quantified
Runtime/efficiency analysis included

⚠️ Research Notes

Critical Implementation Details:

No LR Scheduler: Paper specifically avoids this
No Gradient Clipping: Paper finds it unnecessary
logP Exclusion: Removes this feature (too predictive)
SMILES Ordering: Leverages encoding structure for node selection
3D Computation: Uses RDKit for molecular geometry

Reproducibility:

Fixed random seeds throughout
Deterministic operations
Version-controlled dependencies
Paper-exact hyperparameters

This implementation provides a complete research platform for exploring GNN-based molecular property prediction with chemistry-informed design principles.

Credits

This code was written by Tetiana Lutchyn and refactored by Sebastian Iversen. The models were designed by Tetiana Lutchyn and Benjamin Ricaud. Training and experiments of the model(s) were performed by Tetiana Lutchyn. We thank Claude 3.7 for its help cleaning and refactoring the code.

The results will be published soon and a link to the paper will be added when available.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
models		models
training		training
utils		utils
visualization		visualization
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ablation_studies.py		ablation_studies.py
download_datasets.py		download_datasets.py
main.py		main.py
molecular_analysis.py		molecular_analysis.py
run_all_experiments.py		run_all_experiments.py
setup_script.py		setup_script.py

License

uitml/TChemGNN

Folders and files

Latest commit

History

Repository files navigation