Self-Hosted Speech-to-Text — A Free, Open-Source Alternative to Google / AWS / Azure Cloud Speech APIs
Cloud speech APIs like Google Cloud Speech-to-Text, AWS Transcribe, and Azure Speech are convenient — fully managed, auto-scaling, zero ops. But as volume grows or compliance matters, their downsides show: per-minute billing (expensive at scale), audio is uploaded to the vendor's cloud (privacy / compliance / data-residency), rate limits, internet dependency, and limited customization.
FunASR (open-sourced by Tongyi Lab) is a mature self-hosted alternative: open-source and free (MIT, commercial-friendly), runs on your own machine (fully offline-capable), no per-minute billing, and your audio never leaves your network. It's especially strong on Chinese and Asian languages, and ships an OpenAI-compatible API so migration is nearly free. Here's runnable code.
Transcribe locally in 4 lines (real output)
pip install funasr from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True) # uses CPU if no GPU res = model.generate(input="audio.wav", language="auto", use_itn=True) print(rich_transcription_postprocess(res[0]["text"])) # 欢迎大家来体验达摩院推出的语音识别模型。 (output for a Chinese sample)
The model downloads on first run, then everything is local inference — no API key, no per-minute bill, no audio uploaded anywhere.
Already on a cloud API? Migrate by changing the base_url
FunASR ships an OpenAI-compatible transcription server, so if your app already calls a cloud SDK you often only point base_url at your own service:
# start a local server (OpenAI-compatible /v1/audio/transcriptions)
funasr-server --model sensevoice --device cuda
# client: the OpenAI SDK, just change base_url to your server
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
text = client.audio.transcriptions.create(model="sensevoice", file=open("audio.wav","rb")).text
FunASR vs cloud speech APIs
| FunASR (self-hosted) | Cloud STT (Google/AWS/Azure) | |
|---|---|---|
| Pricing | open-source & free; fixed machine cost only | per-minute, scales linearly with usage |
| Data | stays on your servers (offline-capable) | audio uploaded to the vendor cloud |
| Deployment | you run it (pip / Docker) | fully managed, zero ops |
| Languages | 50+, very strong Chinese/Asian | many (varies by vendor) |
| Speaker diarization | ✅ built-in (cam++) | ✅ (extra cost/config) |
| Emotion / audio events | ✅ (SenseVoice) | mostly ❌ |
| Customization / fine-tune | ✅ full model access | limited |
| Offline / air-gapped | ✅ | ❌ |
| License | open-source, commercial-friendly | proprietary |
To be fair: if what you want is zero ops, elastic scale, and no machines to manage, a managed cloud API is still the easy choice. FunASR's trade-off is that you run one machine of your own in exchange for controllable cost, private data, offline capability, and full customization.
When self-hosting pays off
Cloud STT typically bills per audio minute (roughly $0.6–$1.5 per audio hour depending on vendor/tier). So 1,000 hours/month is about $600–$1,500/month and grows linearly, whereas a self-hosted FunASR instance is a fixed machine cost, largely independent of volume. Self-hosting usually wins when any of these hold:
- High volume: lots of audio per month, where per-minute bills dwarf one machine's cost;
- Sensitive data: healthcare/finance/government, where audio can't leave your network;
- Offline / air-gapped deployment environments;
- Chinese / dialects / Asian languages as the primary need, where you want stronger local accuracy and customization.
See also: the self-hosted OpenAI Whisper API alternative, the Deepgram/AssemblyAI alternative, the FunASR vs Whisper benchmark, and the model selection guide.
The whole FunASR stack is open-source (MIT) — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, self-hosted and commercial-friendly. If it helps, a GitHub Star really supports the project 👇
Also star:SenseVoice · Fun-ASR · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)
- Lightweight Speech Recognition on CPU
- Self-Hosted Deepgram/AssemblyAI Alternative
- Which FunASR Model?
- Cantonese Speech Recognition (SenseVoice)
- Japanese Speech Recognition (SenseVoice)
- Voice Activity Detection in Python