This repository contains the code, experiments, and datasets associated with the paper "Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers". The research focuses on stabilizing machine learning models in finance against distribution shifts and sudden macroeconomic shocks in developing economies by leveraging synthetic outliers.
- Notebooks and scripts for generating synthetic outliers, training models, and evaluating stability.
- Datasets:
- Preprocessed open dataset: Lending Club
- Synthetic data with and without outliers generated by zGAN.
- Metrics:
- Stabilization Score (SS) – measures relative performance drop under shocks normalized by covariate drift.
- Stabilization Uplift (SU) – weight-adjusted metric for comparing two models pre and post-shock.
-
Stability metrics for model drift evaluation under shocks
Introduces a two-level evaluation framework (SS and SU) to quantify model performance under sudden distribution shifts. -
Synthetic outliers for model drift mitigation
Demonstrates that carefully generated synthetic outliers improve model stability when combined with real and synthetic data. -
Focus on economies
Experiments are conducted on datasets from markets where macroeconomic shocks are frequent and unpredictable.
1 pre-processing & research.ipynb
— data preprocessing, exploratory analysis, saving prepared datasets.2 experiments_on_open_data_catboost.ipynb
— experiments with CatBoost on open data.3 experiments_on_open_data_tabpfn.ipynb
— experiments with TabPFN on open data.4 experiments_on_data_stability_other_models.ipynb
— data stability and experiments with other models (LightGBM, NGBoost, TabNet, etc.).artifacts/
— experiment results (csv files).data/
— raw, synthetic, and preprocessed data.src/
— utility scripts and modules.
-
Install dependencies:
pip install -r src/requirements.txt
-
Run Jupyter Notebook:
jupyter notebook
Open the desired notebook and follow the cell instructions.
- Open and synthetic datasets for evaluating uplift model stability.
- Example data paths:
- Preprocessed:
/data/preprocessed/
- Synthetic:
/data/synthetic/
- Preprocessed:
- Comparison of various models (CatBoost, TabPFN, LightGBM, NGBoost, TabNet, HGBoosting, XGBoost, FT-Transformer).
- Analysis of model stability to data changes.
- Results are saved in the
artifacts/
folder.
Experiment results are saved as csv files in the artifacts/
folder.
Authors:
Ilyas Varshavskiy¹, Bonu Boboeva¹, Shuhrat Khalilbekov¹, Azizjon Azimi¹, Sergey Shulgin¹, Akhlitdin Nizamitdinov¹, Haitz Sáez de Ocáriz Borde2,3
Affiliations:
¹ zypl.ai, ² University of Oxford, ³ University of Cambridge
Questions and suggestions: issues or pull requests are welcome!
NB: Python 3.11+ and Jupyter Notebook support are required
This repository is licensed under the Creative Commons Attribution (CC BY) License.