Self-Hosted Speech-to-Text — A Free, Open-Source Alternative to Google / AWS / Azure Cloud Speech APIs

Cloud speech APIs like Google Cloud Speech-to-Text, AWS Transcribe, and Azure Speech are convenient — fully managed, auto-scaling, zero ops. But as volume grows or compliance matters, their downsides show: per-minute billing (expensive at scale), audio is uploaded to the vendor's cloud (privacy / compliance / data-residency), rate limits, internet dependency, and limited customization.

FunASR (open-sourced by Tongyi Lab) is a mature self-hosted alternative: open-source and free (MIT, commercial-friendly), runs on your own machine (fully offline-capable), no per-minute billing, and your audio never leaves your network. It's especially strong on Chinese and Asian languages, and ships an OpenAI-compatible API so migration is nearly free. Here's runnable code.

Transcribe locally in 4 lines (real output)

pip install funasr

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)  # uses CPU if no GPU
res = model.generate(input="audio.wav", language="auto", use_itn=True)

print(rich_transcription_postprocess(res[0]["text"]))
# 欢迎大家来体验达摩院推出的语音识别模型。  (output for a Chinese sample)

The model downloads on first run, then everything is local inference — no API key, no per-minute bill, no audio uploaded anywhere.

Already on a cloud API? Migrate by changing the base_url

FunASR ships an OpenAI-compatible transcription server, so if your app already calls a cloud SDK you often only point base_url at your own service:

# start a local server (OpenAI-compatible /v1/audio/transcriptions)
funasr-server --model sensevoice --device cuda

# client: the OpenAI SDK, just change base_url to your server
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
text = client.audio.transcriptions.create(model="sensevoice", file=open("audio.wav","rb")).text

FunASR vs cloud speech APIs

FunASR (self-hosted)Cloud STT (Google/AWS/Azure)
Pricingopen-source & free; fixed machine cost onlyper-minute, scales linearly with usage
Datastays on your servers (offline-capable)audio uploaded to the vendor cloud
Deploymentyou run it (pip / Docker)fully managed, zero ops
Languages50+, very strong Chinese/Asianmany (varies by vendor)
Speaker diarization✅ built-in (cam++)✅ (extra cost/config)
Emotion / audio events✅ (SenseVoice)mostly ❌
Customization / fine-tune✅ full model accesslimited
Offline / air-gapped
Licenseopen-source, commercial-friendlyproprietary

To be fair: if what you want is zero ops, elastic scale, and no machines to manage, a managed cloud API is still the easy choice. FunASR's trade-off is that you run one machine of your own in exchange for controllable cost, private data, offline capability, and full customization.

When self-hosting pays off

Cloud STT typically bills per audio minute (roughly $0.6–$1.5 per audio hour depending on vendor/tier). So 1,000 hours/month is about $600–$1,500/month and grows linearly, whereas a self-hosted FunASR instance is a fixed machine cost, largely independent of volume. Self-hosting usually wins when any of these hold:

See also: the self-hosted OpenAI Whisper API alternative, the Deepgram/AssemblyAI alternative, the FunASR vs Whisper benchmark, and the model selection guide.

The whole FunASR stack is open-source (MIT) — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, self-hosted and commercial-friendly. If it helps, a GitHub Star really supports the project 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip

Related posts