Skip to content

A text-to-speech and speech-to-text server compatible with the OpenAI API, supporting Whisper, FunASR, Bark, and CosyVoice backends.

License

Notifications You must be signed in to change notification settings

gpustack/vox-box

Vox Box

A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, Dia and CosyVoice.

Requirements

Installation

You can install the project using pip:

pip install vox-box

# For MacOS, you need to manually install `openfst`, `pynini`, and `wetextprocessing` after installing `vox-box` to make `cosyvoice` work:
brew install openfst
export CPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include
export LIBRARY_PATH=$(brew --prefix openfst)/lib
pip install pynini==2.1.6
pip install wetextprocessing==1.0.4.1

Usage

vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir ./cache/data-dir --host 0.0.0.0 --port 80

# Windows
vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:\Users\michelia\AppData\Roaming\vox-box --host 0.0.0.0 --port 8082

Options

  • -d, --debug: Enable debug mode.
  • --host: Host to bind the server to. Default is 0.0.0.0.
  • --port: Port to bind the server to. Default is 80.
  • --model: model path.
  • --device: Binding device, e.g., cuda:0. Default is cpu.
  • --huggingface-repo-id: Huggingface repo id for the model.
  • --model-scope-model-id: Model scope model id for the model.
  • --data-dir: Directory to store downloaded model data. Default is OS specific.

Supported Models

Model Type Link Verified Platforms
Faster-whisper-large-v3 speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-large-v2 speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-large-v1 speech-to-text Hugging Face, ModelScope
Faster-whisper-medium speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-medium.en speech-to-text Hugging Face, ModelScope
Faster-whisper-small speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Faster-whisper-small.en speech-to-text Hugging Face, ModelScope
Faster-distil-whisper-large-v3 speech-to-text Hugging Face, ModelScope MacOS ✅
Faster-distil-whisper-large-v2 speech-to-text Hugging Face, ModelScope MacOS ✅
Faster-distil-whisper-medium.en speech-to-text Hugging Face, ModelScope
Faster-whisper-tiny speech-to-text Hugging Face, ModelScope
Faster-whisper-tiny.en speech-to-text Hugging Face, ModelScope
Paraformer-zh speech-to-text Hugging Face, ModelScope
Paraformer-zh-streaming speech-to-text Hugging Face, ModelScope Linux ✅, MacOS ✅
Paraformer-en speech-to-text Hugging Face, ModelScope
Conformer-en speech-to-text Hugging Face, Modelscope
SenseVoiceSmall speech-to-text Hugging Face, ModelScope Linux ✅, Windows ✅, MacOS ✅
Bark text-to-speech Hugging Face Linux ✅, Windows, MacOS ✅
Bark-small text-to-speech Hugging Face Linux ✅, Windows, MacOS ✅
CosyVoice2-0.5B text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-Instruct text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-SFT text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
CosyVoice-300M-25Hz text-to-speech ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅
Dia-1.6B text-to-speech Hugging Face, ModelScope Linux(ARM not supported) ✅, Windows(Not supported), macOS ✅

Supported APIs

Create speech

Endpoint: POST /v1/audio/speech

Generates audio from the input text. Compatible with the OpenAI audio/speech API.

Example Request:

curl http://localhost/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cosyvoice",
    "input": "Hello world",
    "voice": "English Female"
  }' \
  --output speech.mp3

Response: The audio file content.

Create transcription

Endpoint: POST /v1/audio/transcriptions

Transcribes audio into the input language. Compatible with the OpenAI audio/transcription API.

Example Request:

curl https://localhost/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="whisper-large-v3"

Response:

{
  "text": "Hello world."
}

List Models

Endpoint: GET /v1/models

Returns the current running models.

Get Model

Endpoint: GET /v1/models/{model_id}

Returns the current running model.

Get Voices

Endpoint: GET /v1/voices

Returns the supported voice for current running model.

Health Check

Endpoint: GET /health

Returns the heath check result of the Vox Box.

About

A text-to-speech and speech-to-text server compatible with the OpenAI API, supporting Whisper, FunASR, Bark, and CosyVoice backends.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 5