This repo is a personal laboratory for training autoregressive text-audio models.
Assume everything will change; quality right now is pretty mid. Will get better.
A distillation of Kokoro TTS to the RQ Transformer architecture. Released at 70M and 150M scale.
For MLX inference on Apple Silicon, you'll need a working Python installation. See the mlx_inference folder for setup docs!
# tl;dr
uvx --from smoltts_mlx smoltts-serverCandle.rs docs coming soon.
As of Feb 2025, this project currently uses the Mimi pretrained codec by Kyutai, due to its low framerate (12.5Hz), high compression ratio, and streaming support.
projectgutenberg-kokoro_v1-mimi:
- ~5500 hours of synthetic audio generated with Kokoro v1 for US and UK English.
- 3 million utterances of sentences from Project Gutenberg, mostly 3-15s. 3.29GB compressed with Mimi.
- 11 speakers.
For convenience, we serialize popular open TTS benchmark datasets in Mimi, to directly have training targets and compress the filesize by ~500x:
- LibriTTS-R encoded with Mimi codec. ~460 hours of data.
Unfortunately, HuggingFace Datasets using audio columns require librosa, which has a hard Python 3.9 dependency for inexplicable reasons. If you are not creating a new dataset using raw audio instead of Mimi codes, please feel free to ignore this.
Please use uv.
# If you are not making new audio datasets, feel free to use a sane Python version instead
uv sync
uv pip install -e .Create a .env file and add:
HUGGINGFACE_TOKEN=sk-placeholderFor the dataset and init, see data_pipeline/README.md.
This architecture is most popularly used as the neural codec seq2seq backbone for:
- Fish Speech TTS (in their paper as "DualAR" or dual-autoregressive)
- Kyutai's Moshi model early in pretraining before adaptation to duplex audio.
Models trained here will be compatible with my DualAR fish-speech.rs inference engine.