|
1 | 1 | <p align="center"> |
2 | | -<p align="center"> |
3 | | - <img alt="logo" src="https://raw.githubusercontent.com/scapeML/scape/main/docs/assets/scape_logo.png" height="200"> |
4 | | -</p> |
5 | | -<h1 align="center" margin=0px> |
6 | | -ScAPE: Single-cell Analysis of Perturbational Effects |
7 | | -</h1> |
| 2 | + <img alt="ScAPE Logo" src="https://raw.githubusercontent.com/scapeML/scape/main/docs/assets/scape_logo.png" height="200"> |
8 | 3 | </p> |
9 | 4 |
|
10 | | -[](https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing) |
11 | | - |
12 | | -ScAPE is a package implementing the neural network model used in the _Open Problems – Single-Cell Perturbations challenge_, part of the NeurIPS 2023 Competition Track, hosted by Kaggle. The model won one of the $10,000 Judges' Prizes and achieved top 2% performance on the final [test set](https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/leaderboard) (16th position out of 1097 teams). |
13 | | - |
14 | | -## Description |
| 5 | +<h1 align="center">ScAPE: Single-cell Analysis of Perturbational Effects</h1> |
15 | 6 |
|
16 | | -In this Kaggle competition, the main objective was to predict the effect of drug perturbations on peripheral blood mononuclear cells (PBMCs) from several patient samples. |
17 | | - |
18 | | -Similar to most problems in biological research via omics data, we encountered a high-dimensional feature space (~18k genes) and a low-dimensional observation space (~614 cell/drug combinations) with a low signal-to-noise ratio, where most of the genes show random fluctuations after perturbation. The main data modality to be predicted consisted of signed and log-transformed P-values from differential expression (DE) analysis. In the DE analysis, pseudo-bulk expression profiles from drug-treated cells were compared against the profiles of cells treated with Dimethyl Sulfoxide (DMSO). |
19 | | - |
20 | | -<p align="center"> |
21 | 7 | <p align="center"> |
22 | | - <img alt="description" src="docs/assets/nn-architecture.png" width="720" style="max-width: 100%; height: auto;"> |
23 | | -</p> |
24 | | -<p align="center" margin=0px> |
25 | | -Neural network architecture used for the challenge (ScAPE model). |
26 | | -</p> |
| 8 | + <strong>Predict drug perturbation effects on single-cell gene expression</strong> |
27 | 9 | </p> |
28 | 10 |
|
| 11 | +<p align="center"> |
| 12 | + <a href="https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a> |
| 13 | + <a href="https://zenodo.org/records/10617221"><img src="https://img.shields.io/badge/Data-Zenodo-blue.svg" alt="Data"></a> |
| 14 | + <a href="https://github.com/scapeML/scape/blob/main/LICENSE"><img src="https://img.shields.io/github/license/scapeML/scape.svg" alt="License"></a> |
| 15 | +</p> |
29 | 16 |
|
30 | | -We used a Neural Network that takes as inputs drug and cell features and produces signed log-pvalues. Features were computed as the median of the signed log-pvalues grouped by drugs and cells, calculated from the `de_train.parquet` file (on the training data). Additionally, we also estimated log fold-changes from pseudo bulk gene expression, to produce a matrix of the same shape as the de_train data but containing log fold changes (LFCs) on gene expression. We computed also the median per cell/drug as features for LFCs. |
| 17 | +--- |
31 | 18 |
|
32 | | -Similar to a Conditional Variational Autoencoder (CVAE), we used cell features both in the encoding part and the decoding part of the NN. Initially, the model consisted of a CVAE that was trained using the cell features as the conditional features to learn an encoding/decoding function conditioned on the particular cell type. However, after testing different ways to train the CVAE (similar to a beta-VAE with different annealing strategies for the Kullback-Leibler divergence term), we finally considered a non probabilistic NN since we did not find any practical advantage or better generalizations in this case with respect to a simpler non-probabilistic NN, much easier to train. |
| 19 | +## 🏆 Highlights |
33 | 20 |
|
| 21 | +**Award-winning solution** from the NeurIPS 2023 Single-Cell Perturbations Challenge: |
| 22 | +- 🥇 **$10,000 Judges' Prize** for performance and methodology |
| 23 | +- 🥈 **2nd place** in post-hoc analysis |
| 24 | +- 📊 **Top 2%** overall (16th/1097 teams) |
34 | 25 |
|
35 | | -## Install the package |
| 26 | +## 🚀 Quick Start |
36 | 27 |
|
37 | | -``` |
| 28 | +```bash |
38 | 29 | pip install git+https://github.com/scapeML/scape.git |
39 | 30 | ``` |
40 | 31 |
|
41 | | -## Data |
| 32 | +```python |
| 33 | +import scape |
42 | 34 |
|
43 | | -In addition to the data provided by the challenge, we estimated log fold changes from pseudobulk data that we used as additional features. All the data, including the files from the challenge, can be downloaded from the following link: |
| 35 | +# Train model with drug cross-validation |
| 36 | +result = scape.train( |
| 37 | + de_file="de_train.parquet", |
| 38 | + lfc_file="lfc_train.parquet", |
| 39 | + cv_drug="Belinostat", |
| 40 | + n_genes=64 |
| 41 | +) |
44 | 42 |
|
45 | | -- https://zenodo.org/records/10617221 |
| 43 | +# Visualize performance vs baselines |
| 44 | +scape.plot_result(result) |
| 45 | +``` |
46 | 46 |
|
47 | | -## Usage |
| 47 | +## 📋 Overview |
48 | 48 |
|
| 49 | +ScAPE is a lightweight neural network (~9.6M parameters) that predicts differential gene expression in response to drug perturbations. Built with **Keras 3** for multi-backend support (TensorFlow, JAX, PyTorch). |
49 | 50 |
|
50 | | -### Training |
| 51 | +### Key Features |
51 | 52 |
|
52 | | -ScAPE can be used also as a command line tool. The following command can be used to train a model: |
| 53 | +- 🎯 **Single or Multi-Task Learning**: Predict p-values only or jointly with fold changes |
| 54 | +- 🔄 **Multi-Backend Support**: Choose between TensorFlow, JAX, or PyTorch |
| 55 | +- 🎲 **Built-in Ensemble Methods**: Simple blending for robust predictions |
| 56 | +- 📊 **Cross-Validation**: Cell-type and drug-based validation strategies |
| 57 | +- ⚡ **Efficient**: Handles ~18,000 genes with median-based feature engineering |
53 | 58 |
|
54 | | -``` |
55 | | - python -m scape train --epochs <num-epochs> --n-genes <num-genes> --cv-cell <cell-type> --cv-drug <sm-name> --output-dir <directory> <de-file> <lfc-file> |
56 | | -``` |
57 | | -For example, in order to leave Belinostat out as a drug for cross-validation (using NK cells by default), we can run the following command: |
58 | | - |
59 | | -``` |
60 | | -python -m scape train --n-genes 64 --cv-drug Belinostat --output-dir models de_train.parquet lfc_train.parquet |
61 | | -``` |
| 59 | +### Architecture |
| 60 | + |
| 61 | +The model uses median-based feature engineering: for each drug and cell type, we compute median differential expression values across the dataset. This reduces ~18,000 genes to manageable drug/cell signatures while preserving biological signal. |
| 62 | +<p align="center"> |
| 63 | + <img alt="Architecture" src="docs/assets/nn-architecture.png" width="600"> |
| 64 | +</p> |
| 65 | +Key design choices: |
62 | 66 |
|
63 | | -### Development and Testing |
| 67 | +- **Dual conditioning**: Cell features are used in both encoder and decoder (similar to CVAEs) |
| 68 | +- **Non-probabilistic**: After testing VAE variants, we found a simpler deterministic NN performed equally well. |
| 69 | +- **Multi-source features**: Combines signed log p-values and log fold changes for richer representations |
64 | 70 |
|
65 | | -ScAPE uses [pixi](https://pixi.sh/) for dependency management. To set up the development environment: |
66 | 71 |
|
67 | | -```bash |
68 | | -# Install dependencies |
69 | | -pixi install |
| 72 | +## 💻 Usage |
70 | 73 |
|
71 | | -# Activate development environment |
72 | | -pixi shell -e dev |
| 74 | +### Basic Training |
73 | 75 |
|
74 | | -# Run tests (requires JAX backend) |
75 | | -KERAS_BACKEND=jax pixi run -e dev test |
| 76 | +```bash |
| 77 | +# Command line |
| 78 | +python -m scape train --n-genes 64 --cv-drug Belinostat de_train.parquet lfc_train.parquet |
| 79 | + |
| 80 | +# Python API |
| 81 | +import scape |
| 82 | + |
| 83 | +model = scape.model.create_default_model( |
| 84 | + n_genes=64, |
| 85 | + df_de=de_data, |
| 86 | + df_lfc=lfc_data |
| 87 | +) |
| 88 | + |
| 89 | +results = model.train( |
| 90 | + val_cells=['NK cells'], |
| 91 | + val_drugs=['Belinostat'], |
| 92 | + epochs=600 |
| 93 | +) |
| 94 | +``` |
76 | 95 |
|
77 | | -# Run specific test file with verbose output |
78 | | -KERAS_BACKEND=jax pixi run -e dev test tests/test_multitask.py -v |
| 96 | +### Multi-Task Learning |
79 | 97 |
|
80 | | -# Run linting and formatting |
81 | | -pixi run lint |
82 | | -pixi run format |
| 98 | +Configure the model to jointly predict both p-values and fold changes: |
| 99 | + |
| 100 | +```python |
| 101 | +# Multi-task configuration with optimal weights |
| 102 | +model.model.compile( |
| 103 | + optimizer=optimizer, |
| 104 | + loss={'slogpval': mrrmse, 'lfc': mrrmse}, |
| 105 | + loss_weights={'slogpval': 0.8, 'lfc': 0.2} |
| 106 | +) |
83 | 107 | ``` |
84 | 108 |
|
| 109 | +### Backend Selection |
85 | 110 |
|
86 | | -## Interpreting error plots |
| 111 | +```bash |
| 112 | +# Use JAX backend (recommended for performance) |
| 113 | +KERAS_BACKEND=jax python -m scape train ... |
87 | 114 |
|
88 | | -The method `scape.util.plot_result(result, legend=True)` can be used to plot the CV results after training a model, as shown in the [quick-start notebook](https://github.com/scapeML/scape/blob/main/docs/notebooks/quick-start.ipynb). The following figure shows an example of the output of this method: |
| 115 | +# Use TensorFlow backend |
| 116 | +KERAS_BACKEND=tensorflow python -m scape train ... |
89 | 117 |
|
90 | | -<p align="center"> |
91 | | - <img alt="prednisolone-cv-nk" src="docs/assets/example-nk-prednisolone.png" width="720" style="max-width: 100%; height: auto;"> |
92 | | -</p> |
| 118 | +# Use PyTorch backend |
| 119 | +KERAS_BACKEND=torch python -m scape train ... |
| 120 | +``` |
93 | 121 |
|
| 122 | +### Ensemble Predictions |
94 | 123 |
|
95 | | -The plot shows two different baselines. The top dotted line shows the performance of a model that always predicts 0s, as the one used as baseline in the Kaggle challenge. The bottom dotted line shows the performance of a model that always predicts the median of the training data (grouped by drug type). This baseline is useful to compare the performance of the model with a simple model that does not learn anything. The solid line indicates the best validation error. |
| 124 | +Improve robustness with simple ensemble blending: |
96 | 125 |
|
97 | | -The title of the plot indicates how much better the model is with respect to the baselines. The percentages are computed as follows: |
| 126 | +```python |
| 127 | +from sklearn.model_selection import KFold |
| 128 | +import numpy as np |
98 | 129 |
|
99 | | -``` |
100 | | -improvement = 100 * (1 - (current / baseline_error)) |
101 | | -``` |
| 130 | +# Train multiple models with K-fold |
| 131 | +predictions = [] |
| 132 | +for train_idx, val_idx in KFold(n_splits=5).split(all_combinations): |
| 133 | + model = scape.model.create_default_model(...) |
| 134 | + model.train(...) |
| 135 | + predictions.append(model.predict(test_combinations)) |
102 | 136 |
|
103 | | -For example, in the figure above, the trained model is 25.31% better than the baseline that always predicts 0s, and only 5.48% better than the baseline that always predicts the median of the signed log p-values across drugs in the training data. |
| 137 | +# Blend predictions (median) |
| 138 | +ensemble_pred = np.median([p.values for p in predictions], axis=0) |
| 139 | +``` |
104 | 140 |
|
105 | | -## Notebooks |
| 141 | +### Advanced Configuration |
| 142 | + |
| 143 | +```python |
| 144 | +# Custom architecture |
| 145 | +config = { |
| 146 | + "encoder_hidden_layer_sizes": [128, 128], |
| 147 | + "decoder_hidden_layer_sizes": [128, 512], |
| 148 | + "outputs": { |
| 149 | + "slogpval": (64, "linear"), |
| 150 | + "lfc": (64, "linear"), # Multi-task |
| 151 | + }, |
| 152 | + "noise": 0.01, |
| 153 | + "dropout": 0.05 |
| 154 | +} |
106 | 155 |
|
107 | | -- Basic usage: https://github.com/scapeML/scape/blob/main/docs/notebooks/quick-start.ipynb |
108 | | -- Training pipeline: https://github.com/scapeML/scape/blob/main/docs/notebooks/solution.ipynb |
109 | | -- Performance with different top genes: https://github.com/scapeML/scape/blob/main/docs/notebooks/subset-genes.ipynb |
| 156 | +model = scape.model.create_model( |
| 157 | + n_genes=64, |
| 158 | + df_de=de_data, |
| 159 | + df_lfc=lfc_data, |
| 160 | + config=config |
| 161 | +) |
| 162 | +``` |
110 | 163 |
|
111 | | -## Final report |
| 164 | +## 📊 Performance Visualization |
112 | 165 |
|
113 | | -Prior to the refactor and creation of the ScAPE package, we used a simplified version of the model to explore different questions and to do hyperparameter tuning. The notebook used to generate the final report can be found in the following link: |
| 166 | +<p align="center"> |
| 167 | + <img alt="Performance Example" src="docs/assets/example-nk-prednisolone.png" width="600"> |
| 168 | +</p> |
114 | 169 |
|
115 | | -- https://github.com/scapeML/scape/blob/main/docs/report.pdf |
| 170 | +Track model improvement over baselines: |
| 171 | +- **Zero baseline**: Always predicts 0 (competition baseline) |
| 172 | +- **Median baseline**: Predicts drug-specific medians |
116 | 173 |
|
| 174 | +## 📚 Resources |
117 | 175 |
|
118 | | -## Reproducibility |
| 176 | +- 📓 [Quick Start Tutorial](https://github.com/scapeML/scape/blob/main/docs/notebooks/quick-start.ipynb) |
| 177 | +- 📓 [Training Pipeline](https://github.com/scapeML/scape/blob/main/docs/notebooks/solution.ipynb) |
| 178 | +- 📓 [Google Colab Demo](https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing) |
| 179 | +- 📄 [Technical Report](https://github.com/scapeML/scape/blob/main/docs/report.pdf) |
| 180 | +- 💾 [Dataset (Zenodo)](https://zenodo.org/records/10617221) |
119 | 181 |
|
120 | | -The following notebook can be used to reproduce the results of our submission: https://github.com/scapeML/scape/blob/main/docs/notebooks/solution.ipynb. |
| 182 | +## 🛠️ Development |
121 | 183 |
|
122 | | -In addition, we've created a [Google Colab](https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing) notebook showing how to install, train and predict using the ScAPE package. |
| 184 | +```bash |
| 185 | +# Setup with pixi |
| 186 | +pixi install |
| 187 | +pixi shell -e dev |
123 | 188 |
|
124 | | -## Citation |
| 189 | +# Run tests (JAX backend recommended) |
| 190 | +KERAS_BACKEND=jax pixi run -e dev test |
125 | 191 |
|
| 192 | +# Lint & format |
| 193 | +pixi run lint |
| 194 | +pixi run format |
126 | 195 | ``` |
| 196 | + |
| 197 | +## 📖 Citation |
| 198 | + |
| 199 | +```bibtex |
127 | 200 | @misc{rodriguezmier24scape, |
128 | | - author = {Rodriguez-Mier, Pablo and Garrido-Rodriguez, Martin}, |
129 | | - title = {ScAPE: Single-cell Analysis of Perturbational Effects}, |
130 | | - year = {2024}, |
131 | | - url = {https://github.com/scapeML/scape} |
| 201 | + author = {Rodriguez-Mier, Pablo and Garrido-Rodriguez, Martin}, |
| 202 | + title = {ScAPE: Single-cell Analysis of Perturbational Effects}, |
| 203 | + year = {2024}, |
| 204 | + url = {https://github.com/scapeML/scape} |
132 | 205 | } |
133 | 206 | ``` |
| 207 | + |
0 commit comments