Self-Hosted OpenAI Whisper API Alternative: Drop-in /v1/audio/transcriptions with FunASR

FunASR Blog · 2026-06-18 · Tutorial

OpenAI's /v1/audio/transcriptions (the Whisper API) charges per minute, uploads your audio to the cloud, and is rate-limited. If you just need audio turned into text, you can run a server with an identical interface on your own machine — existing code switches over by changing a single base_url. FunASR ships exactly that as funasr-server. Every snippet below is tested.

Start an OpenAI-compatible server in one line

pip install -U funasr
funasr-server --model sensevoice --device cuda     # listens on localhost:8000

You now have the same two endpoints OpenAI exposes: POST /v1/audio/transcriptions and GET /v1/models.

Call it with the official OpenAI SDK (just change base_url)

Your existing OpenAI code is almost unchanged — point base_url at the local server and put any non-empty api_key:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

with open("audio.wav", "rb") as f:
    r = client.audio.transcriptions.create(model="sensevoice", file=f)
print(r.text)
# -> 欢迎大家来体验达摩院推出的语音识别模型   (the Chinese sample above)

Listing models works the same way:

print([m.id for m in client.models.list().data])
# -> ['fun-asr-nano', 'sensevoice', 'paraformer']

Or use curl / any HTTP client

curl http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=sensevoice"
# -> {"text": "..."}

Because the API is OpenAI-compatible, the Node openai package, LangChain, and most existing SDKs connect to it unchanged.

OpenAI Whisper API vs self-hosted FunASR

	OpenAI Whisper API	FunASR self-hosted
Cost	$0.006 / min	Free (your hardware)
Privacy	Audio leaves your machine	Stays local
Rate limits	Yes	None
Chinese accuracy	Mediocre	Higher (benchmark)
Speed	—	SenseVoice is non-autoregressive, far faster than Whisper
Interface	/v1/audio/transcriptions	Identical

Which model to pick

sensevoice — default; non-autoregressive, very fast, 50+ languages, with emotion/events. Best everyday choice.
fun-asr-nano — flagship LLM decoder for zh/en/ja plus Chinese dialects/accents (needs vLLM).
For 31 languages, deploy the separate FunAudioLLM/Fun-ASR-MLT-Nano-2512 checkpoint.
paraformer — classic, production-grade Chinese.

Just set model= in the request — no server restart needed.

What changes when migrating from OpenAI

base_url: point it at your funasr-server.
api_key: not validated locally — any non-empty string works.
Everything else: the client.audio.transcriptions.create(...) call is the same.

Use FunASR as the speech-to-text in Open-WebUI

Open-WebUI (the 150k-star self-hosted LLM interface) lets its speech-to-text (STT) point at any OpenAI-compatible endpoint. Since funasr-server is OpenAI-compatible, you just point it there — no plugin, no Open-WebUI code changes — and get more accurate Chinese transcription than the built-in Whisper, fully local and free.

1. Start FunASR (as above):

funasr-server --model sensevoice --device cuda   # listens on localhost:8000

2. In Open-WebUI: Admin Panel → Settings → Audio → Speech-to-Text, set Engine to OpenAI and fill in:

API Base URL: http://localhost:8000/v1 (if Open-WebUI runs in Docker and FunASR on the host, use http://host.docker.internal:8000/v1)
API Key: any non-empty string (funasr-server does not validate it)
STT Model: sensevoice (or paraformer / fun-asr-nano)

With Docker Compose this is the same as a few environment variables:

AUDIO_STT_ENGINE=openai
AUDIO_STT_OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1
AUDIO_STT_OPENAI_API_KEY=funasr
AUDIO_STT_MODEL=sensevoice

Now Open-WebUI's microphone / voice input runs through FunASR — better Chinese accuracy, data never leaves your machine. The same config works for any tool that supports a custom OpenAI STT endpoint (LibreChat, AnythingLLM, etc.).

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit.

Star FunASR on GitHub ★