Vox Box

A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, Dia and CosyVoice.

Requirements

Python 3.10 or greater
Support Nvidia GPU, requires the following NVIDIA libraries to be installed:
- cuBLAS for CUDA 12
- cuDNN 9 for CUDA 12

Installation

You can install the project using pip:

pip install vox-box

# For MacOS, you need to manually install `openfst`, `pynini`, and `wetextprocessing` after installing `vox-box` to make `cosyvoice` work:
brew install openfst
export CPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include
export LIBRARY_PATH=$(brew --prefix openfst)/lib
pip install pynini==2.1.6
pip install wetextprocessing==1.0.4.1

Usage

vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir ./cache/data-dir --host 0.0.0.0 --port 80

# Windows
vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\Users\michelia\AppData\Roaming\vox-box --host 0.0.0.0 --port 8082

Options

-d, --debug: Enable debug mode.
--host: Host to bind the server to. Default is 0.0.0.0.
--port: Port to bind the server to. Default is 80.
--model: model path.
--device: Binding device, e.g., cuda:0. Default is cpu.
--huggingface-repo-id: Huggingface repo id for the model.
--model-scope-model-id: Model scope model id for the model.
--data-dir: Directory to store downloaded model data. Default is OS specific.

Supported Models

Model	Type	Link	Verified Platforms
Faster-whisper-large-v3	speech-to-text	Hugging Face, ModelScope	Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-large-v2	speech-to-text	Hugging Face, ModelScope	Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-large-v1	speech-to-text	Hugging Face, ModelScope
Faster-whisper-medium	speech-to-text	Hugging Face, ModelScope	Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-medium.en	speech-to-text	Hugging Face, ModelScope
Faster-whisper-small	speech-to-text	Hugging Face, ModelScope	Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-small.en	speech-to-text	Hugging Face, ModelScope
Faster-distil-whisper-large-v3	speech-to-text	Hugging Face, ModelScope	MacOS ✅
Faster-distil-whisper-large-v2	speech-to-text	Hugging Face, ModelScope	MacOS ✅
Faster-distil-whisper-medium.en	speech-to-text	Hugging Face, ModelScope
Faster-whisper-tiny	speech-to-text	Hugging Face, ModelScope
Faster-whisper-tiny.en	speech-to-text	Hugging Face, ModelScope
Paraformer-zh	speech-to-text	Hugging Face, ModelScope
Paraformer-zh-streaming	speech-to-text	Hugging Face, ModelScope	Linux ✅, MacOS ✅
Paraformer-en	speech-to-text	Hugging Face, ModelScope
Conformer-en	speech-to-text	Hugging Face, Modelscope
SenseVoiceSmall	speech-to-text	Hugging Face, ModelScope	Linux ✅, Windows ✅, MacOS ✅
Bark	text-to-speech	Hugging Face	Linux ✅, Windows, MacOS ✅
Bark-small	text-to-speech	Hugging Face	Linux ✅, Windows, MacOS ✅
CosyVoice2-0.5B	text-to-speech	Hugging Face, ModelScope	Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-Instruct	text-to-speech	Hugging Face, ModelScope	Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-SFT	text-to-speech	Hugging Face, ModelScope	Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M	text-to-speech	Hugging Face, ModelScope	Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-25Hz	text-to-speech	ModelScope	Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
Dia-1.6B	text-to-speech	Hugging Face, ModelScope	Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅

Supported APIs

Create speech

Endpoint: POST /v1/audio/speech

Generates audio from the input text. Compatible with the OpenAI audio/speech API.

Example Request:

curl http://localhost/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cosyvoice",
    "input": "Hello world",
    "voice": "English Female"
  }' \
  --output speech.mp3

Response: The audio file content.

Create transcription

Endpoint: POST /v1/audio/transcriptions

Transcribes audio into the input language. Compatible with the OpenAI audio/transcription API.

Example Request:

curl https://localhost/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="whisper-large-v3"

Response:

{
  "text": "Hello world."
}

List Models

Endpoint: GET /v1/models

Returns the current running models.

Get Model

Endpoint: GET /v1/models/{model_id}

Returns the current running model.

Get Voices

Endpoint: GET /v1/voices

Returns the supported voice for current running model.

Health Check

Endpoint: GET /health

Returns the heath check result of the Vox Box.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
hack		hack
vox_box		vox_box
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vox Box

Requirements

Installation

Usage

Options

Supported Models

Supported APIs

Create speech

Create transcription

List Models

Get Model

Get Voices

Health Check

About

Uh oh!

Releases 20

Packages

Uh oh!

Contributors 5

Languages

License

gpustack/vox-box

Folders and files

Latest commit

History

Repository files navigation

Vox Box

Requirements

Installation

Usage

Options

Supported Models

Supported APIs

Create speech

Create transcription

List Models

Get Model

Get Voices

Health Check

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Contributors 5

Languages

Packages