Skip to content

Commit a6ccdfb

Browse files
committed
docs: improve README and logo
1 parent 57fc272 commit a6ccdfb

File tree

2 files changed

+153
-79
lines changed

2 files changed

+153
-79
lines changed

README.md

Lines changed: 153 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,133 +1,207 @@
11
<p align="center">
2-
<p align="center">
3-
<img alt="logo" src="https://raw.githubusercontent.com/scapeML/scape/main/docs/assets/scape_logo.png" height="200">
4-
</p>
5-
<h1 align="center" margin=0px>
6-
ScAPE: Single-cell Analysis of Perturbational Effects
7-
</h1>
2+
<img alt="ScAPE Logo" src="https://raw.githubusercontent.com/scapeML/scape/main/docs/assets/scape_logo.png" height="200">
83
</p>
94

10-
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing)
11-
12-
ScAPE is a package implementing the neural network model used in the _Open Problems – Single-Cell Perturbations challenge_, part of the NeurIPS 2023 Competition Track, hosted by Kaggle. The model won one of the $10,000 Judges' Prizes and achieved top 2% performance on the final [test set](https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/leaderboard) (16th position out of 1097 teams).
13-
14-
## Description
5+
<h1 align="center">ScAPE: Single-cell Analysis of Perturbational Effects</h1>
156

16-
In this Kaggle competition, the main objective was to predict the effect of drug perturbations on peripheral blood mononuclear cells (PBMCs) from several patient samples.
17-
18-
Similar to most problems in biological research via omics data, we encountered a high-dimensional feature space (~18k genes) and a low-dimensional observation space (~614 cell/drug combinations) with a low signal-to-noise ratio, where most of the genes show random fluctuations after perturbation. The main data modality to be predicted consisted of signed and log-transformed P-values from differential expression (DE) analysis. In the DE analysis, pseudo-bulk expression profiles from drug-treated cells were compared against the profiles of cells treated with Dimethyl Sulfoxide (DMSO).
19-
20-
<p align="center">
217
<p align="center">
22-
<img alt="description" src="docs/assets/nn-architecture.png" width="720" style="max-width: 100%; height: auto;">
23-
</p>
24-
<p align="center" margin=0px>
25-
Neural network architecture used for the challenge (ScAPE model).
26-
</p>
8+
<strong>Predict drug perturbation effects on single-cell gene expression</strong>
279
</p>
2810

11+
<p align="center">
12+
<a href="https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
13+
<a href="https://zenodo.org/records/10617221"><img src="https://img.shields.io/badge/Data-Zenodo-blue.svg" alt="Data"></a>
14+
<a href="https://github.com/scapeML/scape/blob/main/LICENSE"><img src="https://img.shields.io/github/license/scapeML/scape.svg" alt="License"></a>
15+
</p>
2916

30-
We used a Neural Network that takes as inputs drug and cell features and produces signed log-pvalues. Features were computed as the median of the signed log-pvalues grouped by drugs and cells, calculated from the `de_train.parquet` file (on the training data). Additionally, we also estimated log fold-changes from pseudo bulk gene expression, to produce a matrix of the same shape as the de_train data but containing log fold changes (LFCs) on gene expression. We computed also the median per cell/drug as features for LFCs.
17+
---
3118

32-
Similar to a Conditional Variational Autoencoder (CVAE), we used cell features both in the encoding part and the decoding part of the NN. Initially, the model consisted of a CVAE that was trained using the cell features as the conditional features to learn an encoding/decoding function conditioned on the particular cell type. However, after testing different ways to train the CVAE (similar to a beta-VAE with different annealing strategies for the Kullback-Leibler divergence term), we finally considered a non probabilistic NN since we did not find any practical advantage or better generalizations in this case with respect to a simpler non-probabilistic NN, much easier to train.
19+
## 🏆 Highlights
3320

21+
**Award-winning solution** from the NeurIPS 2023 Single-Cell Perturbations Challenge:
22+
- 🥇 **$10,000 Judges' Prize** for performance and methodology
23+
- 🥈 **2nd place** in post-hoc analysis
24+
- 📊 **Top 2%** overall (16th/1097 teams)
3425

35-
## Install the package
26+
## 🚀 Quick Start
3627

37-
```
28+
```bash
3829
pip install git+https://github.com/scapeML/scape.git
3930
```
4031

41-
## Data
32+
```python
33+
import scape
4234

43-
In addition to the data provided by the challenge, we estimated log fold changes from pseudobulk data that we used as additional features. All the data, including the files from the challenge, can be downloaded from the following link:
35+
# Train model with drug cross-validation
36+
result = scape.train(
37+
de_file="de_train.parquet",
38+
lfc_file="lfc_train.parquet",
39+
cv_drug="Belinostat",
40+
n_genes=64
41+
)
4442

45-
- https://zenodo.org/records/10617221
43+
# Visualize performance vs baselines
44+
scape.plot_result(result)
45+
```
4646

47-
## Usage
47+
## 📋 Overview
4848

49+
ScAPE is a lightweight neural network (~9.6M parameters) that predicts differential gene expression in response to drug perturbations. Built with **Keras 3** for multi-backend support (TensorFlow, JAX, PyTorch).
4950

50-
### Training
51+
### Key Features
5152

52-
ScAPE can be used also as a command line tool. The following command can be used to train a model:
53+
- 🎯 **Single or Multi-Task Learning**: Predict p-values only or jointly with fold changes
54+
- 🔄 **Multi-Backend Support**: Choose between TensorFlow, JAX, or PyTorch
55+
- 🎲 **Built-in Ensemble Methods**: Simple blending for robust predictions
56+
- 📊 **Cross-Validation**: Cell-type and drug-based validation strategies
57+
-**Efficient**: Handles ~18,000 genes with median-based feature engineering
5358

54-
```
55-
python -m scape train --epochs <num-epochs> --n-genes <num-genes> --cv-cell <cell-type> --cv-drug <sm-name> --output-dir <directory> <de-file> <lfc-file>
56-
```
57-
For example, in order to leave Belinostat out as a drug for cross-validation (using NK cells by default), we can run the following command:
58-
59-
```
60-
python -m scape train --n-genes 64 --cv-drug Belinostat --output-dir models de_train.parquet lfc_train.parquet
61-
```
59+
### Architecture
60+
61+
The model uses median-based feature engineering: for each drug and cell type, we compute median differential expression values across the dataset. This reduces ~18,000 genes to manageable drug/cell signatures while preserving biological signal.
62+
<p align="center">
63+
<img alt="Architecture" src="docs/assets/nn-architecture.png" width="600">
64+
</p>
65+
Key design choices:
6266

63-
### Development and Testing
67+
- **Dual conditioning**: Cell features are used in both encoder and decoder (similar to CVAEs)
68+
- **Non-probabilistic**: After testing VAE variants, we found a simpler deterministic NN performed equally well.
69+
- **Multi-source features**: Combines signed log p-values and log fold changes for richer representations
6470

65-
ScAPE uses [pixi](https://pixi.sh/) for dependency management. To set up the development environment:
6671

67-
```bash
68-
# Install dependencies
69-
pixi install
72+
## 💻 Usage
7073

71-
# Activate development environment
72-
pixi shell -e dev
74+
### Basic Training
7375

74-
# Run tests (requires JAX backend)
75-
KERAS_BACKEND=jax pixi run -e dev test
76+
```bash
77+
# Command line
78+
python -m scape train --n-genes 64 --cv-drug Belinostat de_train.parquet lfc_train.parquet
79+
80+
# Python API
81+
import scape
82+
83+
model = scape.model.create_default_model(
84+
n_genes=64,
85+
df_de=de_data,
86+
df_lfc=lfc_data
87+
)
88+
89+
results = model.train(
90+
val_cells=['NK cells'],
91+
val_drugs=['Belinostat'],
92+
epochs=600
93+
)
94+
```
7695

77-
# Run specific test file with verbose output
78-
KERAS_BACKEND=jax pixi run -e dev test tests/test_multitask.py -v
96+
### Multi-Task Learning
7997

80-
# Run linting and formatting
81-
pixi run lint
82-
pixi run format
98+
Configure the model to jointly predict both p-values and fold changes:
99+
100+
```python
101+
# Multi-task configuration with optimal weights
102+
model.model.compile(
103+
optimizer=optimizer,
104+
loss={'slogpval': mrrmse, 'lfc': mrrmse},
105+
loss_weights={'slogpval': 0.8, 'lfc': 0.2}
106+
)
83107
```
84108

109+
### Backend Selection
85110

86-
## Interpreting error plots
111+
```bash
112+
# Use JAX backend (recommended for performance)
113+
KERAS_BACKEND=jax python -m scape train ...
87114

88-
The method `scape.util.plot_result(result, legend=True)` can be used to plot the CV results after training a model, as shown in the [quick-start notebook](https://github.com/scapeML/scape/blob/main/docs/notebooks/quick-start.ipynb). The following figure shows an example of the output of this method:
115+
# Use TensorFlow backend
116+
KERAS_BACKEND=tensorflow python -m scape train ...
89117

90-
<p align="center">
91-
<img alt="prednisolone-cv-nk" src="docs/assets/example-nk-prednisolone.png" width="720" style="max-width: 100%; height: auto;">
92-
</p>
118+
# Use PyTorch backend
119+
KERAS_BACKEND=torch python -m scape train ...
120+
```
93121

122+
### Ensemble Predictions
94123

95-
The plot shows two different baselines. The top dotted line shows the performance of a model that always predicts 0s, as the one used as baseline in the Kaggle challenge. The bottom dotted line shows the performance of a model that always predicts the median of the training data (grouped by drug type). This baseline is useful to compare the performance of the model with a simple model that does not learn anything. The solid line indicates the best validation error.
124+
Improve robustness with simple ensemble blending:
96125

97-
The title of the plot indicates how much better the model is with respect to the baselines. The percentages are computed as follows:
126+
```python
127+
from sklearn.model_selection import KFold
128+
import numpy as np
98129

99-
```
100-
improvement = 100 * (1 - (current / baseline_error))
101-
```
130+
# Train multiple models with K-fold
131+
predictions = []
132+
for train_idx, val_idx in KFold(n_splits=5).split(all_combinations):
133+
model = scape.model.create_default_model(...)
134+
model.train(...)
135+
predictions.append(model.predict(test_combinations))
102136

103-
For example, in the figure above, the trained model is 25.31% better than the baseline that always predicts 0s, and only 5.48% better than the baseline that always predicts the median of the signed log p-values across drugs in the training data.
137+
# Blend predictions (median)
138+
ensemble_pred = np.median([p.values for p in predictions], axis=0)
139+
```
104140

105-
## Notebooks
141+
### Advanced Configuration
142+
143+
```python
144+
# Custom architecture
145+
config = {
146+
"encoder_hidden_layer_sizes": [128, 128],
147+
"decoder_hidden_layer_sizes": [128, 512],
148+
"outputs": {
149+
"slogpval": (64, "linear"),
150+
"lfc": (64, "linear"), # Multi-task
151+
},
152+
"noise": 0.01,
153+
"dropout": 0.05
154+
}
106155

107-
- Basic usage: https://github.com/scapeML/scape/blob/main/docs/notebooks/quick-start.ipynb
108-
- Training pipeline: https://github.com/scapeML/scape/blob/main/docs/notebooks/solution.ipynb
109-
- Performance with different top genes: https://github.com/scapeML/scape/blob/main/docs/notebooks/subset-genes.ipynb
156+
model = scape.model.create_model(
157+
n_genes=64,
158+
df_de=de_data,
159+
df_lfc=lfc_data,
160+
config=config
161+
)
162+
```
110163

111-
## Final report
164+
## 📊 Performance Visualization
112165

113-
Prior to the refactor and creation of the ScAPE package, we used a simplified version of the model to explore different questions and to do hyperparameter tuning. The notebook used to generate the final report can be found in the following link:
166+
<p align="center">
167+
<img alt="Performance Example" src="docs/assets/example-nk-prednisolone.png" width="600">
168+
</p>
114169

115-
- https://github.com/scapeML/scape/blob/main/docs/report.pdf
170+
Track model improvement over baselines:
171+
- **Zero baseline**: Always predicts 0 (competition baseline)
172+
- **Median baseline**: Predicts drug-specific medians
116173

174+
## 📚 Resources
117175

118-
## Reproducibility
176+
- 📓 [Quick Start Tutorial](https://github.com/scapeML/scape/blob/main/docs/notebooks/quick-start.ipynb)
177+
- 📓 [Training Pipeline](https://github.com/scapeML/scape/blob/main/docs/notebooks/solution.ipynb)
178+
- 📓 [Google Colab Demo](https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing)
179+
- 📄 [Technical Report](https://github.com/scapeML/scape/blob/main/docs/report.pdf)
180+
- 💾 [Dataset (Zenodo)](https://zenodo.org/records/10617221)
119181

120-
The following notebook can be used to reproduce the results of our submission: https://github.com/scapeML/scape/blob/main/docs/notebooks/solution.ipynb.
182+
## 🛠️ Development
121183

122-
In addition, we've created a [Google Colab](https://colab.research.google.com/drive/1-o_lT-ttoKS-nbozj2RQusGoi-vm0-XL?usp=sharing) notebook showing how to install, train and predict using the ScAPE package.
184+
```bash
185+
# Setup with pixi
186+
pixi install
187+
pixi shell -e dev
123188

124-
## Citation
189+
# Run tests (JAX backend recommended)
190+
KERAS_BACKEND=jax pixi run -e dev test
125191

192+
# Lint & format
193+
pixi run lint
194+
pixi run format
126195
```
196+
197+
## 📖 Citation
198+
199+
```bibtex
127200
@misc{rodriguezmier24scape,
128-
author = {Rodriguez-Mier, Pablo and Garrido-Rodriguez, Martin},
129-
title = {ScAPE: Single-cell Analysis of Perturbational Effects},
130-
year = {2024},
131-
url = {https://github.com/scapeML/scape}
201+
author = {Rodriguez-Mier, Pablo and Garrido-Rodriguez, Martin},
202+
title = {ScAPE: Single-cell Analysis of Perturbational Effects},
203+
year = {2024},
204+
url = {https://github.com/scapeML/scape}
132205
}
133206
```
207+

docs/assets/scape_logo.png

363 KB
Loading

0 commit comments

Comments
 (0)