Fish Speech S2: The Open-Source TTS That Beat the Closed-Source Giants

Fish Audio's 4B-parameter Dual-AR model trained on 10 million hours of audio achieves the lowest word error rate of any TTS system, open or closed. Inline emotion tags, 50-language support, and sub-100ms streaming make it the new benchmark.

fishaudio/fish-speech · 12 min read

A large mechanical fish constructed from circuit boards and audio waveforms leaps out of water made of flowing text. Sound waves radiate from its mouth. Small fish in the background represent competing TTS systems falling behind. — A mechanical fish outswims the competition. The open-source TTS race just got a new leader.

Key Takeaways

Fish Audio S2 achieves the lowest word error rate on Seed-TTS Eval of any system tested, beating Qwen3-TTS, MiniMax Speech-02, and ByteDance's own Seed-TTS.
The Dual-Autoregressive architecture splits a 4B Slow AR (time axis) from a 400M Fast AR (depth axis), making production streaming at 0.195 RTF possible on a single H200.
Free-form natural-language inline tags like [whisper], [laugh], and [super happy] let you control emotion at the word level without a predefined tag vocabulary.
Trained on 10M+ hours across ~50 languages, S2 won best WER in 11 of 24 languages and best speaker similarity in 17 of 24 on MiniMax's multilingual testset.

From Dorm Room GPUs to State of the Art

Shijia Liao graduated from the University of Maryland in 2023 and joined NVIDIA to scale Vision Foundation Models across GPU clusters. Then he quit. He spent countless nights prototyping Fish Audio on a home rig of 4090 GPUs with a four-person Gen Z founding team.

That team had serious open-source credentials. Before Fish Audio existed as a company, Liao led or contributed to So-VITS-SVC, GPT-SoVITS, and Bert-VITS2. These projects are still widely used in creative coding and research communities. Fish Speech was always meant to be the system that turned those experiments into something production-ready.

The growth numbers tell the story. Between January and April 2025, Fish Audio scaled annualized revenue from $400,000 to over $5 million. Monthly active users jumped from 50,000 to 420,000. All of this happened before S2 was even released.

From a home GPU rig to state-of-the-art benchmarks. The founding story of Fish Audio.

What Makes S2 Different

Most TTS systems treat voice generation as a monolithic pipeline: text goes in, audio comes out, and you hope the middle does something reasonable. Fish Audio S2 rethinks the architecture from the ground up with a technique called Dual-Autoregressive generation.

The insight is that speech has two axes that matter. The time axis determines what gets said and when. The depth axis determines the acoustic richness of each moment. S2 assigns separate models to each axis, with radically different parameter budgets.

The Dual-AR Architecture

The Slow AR is a 4-billion-parameter decoder-only transformer. It operates along the time axis and predicts tokens from the primary semantic codebook at roughly 21 frames per second. This is the expensive, high-stakes model. It decides the content, rhythm, and prosody of the speech.

The Fast AR is a 400-million-parameter model that runs at each time step. Given the semantic token from Slow AR, it autoregressively generates the remaining 9 codebooks along the depth axis. These residual codebooks reconstruct fine-grained acoustic detail: timbre, breathiness, room characteristics.

This asymmetric split is the key architectural decision. The time axis gets the heavy model because semantic content is hard. The depth axis gets the lightweight model because acoustic detail, given good semantics, is comparatively easy. The result is an efficient system that does not sacrifice fidelity.

Under the hood, the audio codec is based on Descript Audio Codec (DAC) with 10 codebook layers using Residual Vector Quantization (RVQ). The entire setup is structurally isomorphic to a standard autoregressive LLM. That seemingly academic point has massive practical consequences.

Two parallel tracks of a railway diverging from a single input. The top track is wide and heavy labeled Slow AR for semantics. The bottom track is narrow and fast labeled Fast AR for acoustics. Both tracks merge at a decoder station on the right. — Two tracks, two budgets. The Dual-AR architecture in a nutshell.

GRPO: Reinforcement Learning for Voice

Pre-training on 10 million hours of audio gets you a strong base model. But the gap between "good" and "indistinguishable from human" requires alignment. Fish Audio S2 uses Group Relative Policy Optimization (GRPO) for this final push.

GRPO is a variant of reinforcement learning from human feedback, originally developed for LLMs. The clever adaptation here is that the same models used to filter and annotate training data during pre-training are directly reused as reward models during RL. This eliminates the distribution mismatch that plagues most RLHF setups.

The reward signal combines four components: semantic accuracy (measured via ASR word error rate), instruction adherence (does it actually whisper when you say [whisper]?), acoustic preference scoring (overall audio quality), and timbre similarity (does the cloned voice match the reference?). Each signal pulls the model toward a different dimension of quality.

"S2 uses GRPO for post-training alignment. The same models used to filter and annotate training data are directly reused as reward models during RL, eliminating distribution mismatch between pre-training data and post-training objectives."

-- Fish Audio S2 Technical Report, March 2026

Inline Emotion Control

Most TTS systems that support emotion control give you a dropdown menu: happy, sad, angry, neutral. Maybe a few more if you are lucky. Fish Audio S2 throws out the dropdown entirely.

Instead, you embed free-form natural-language instructions directly into the text at the exact position where you want the emotional shift. Write [whisper in small voice] before a word and that word gets whispered. Write [professional broadcast tone] at the start of a paragraph and the entire paragraph shifts register. Write [laugh] and the model laughs.

This is not a fixed tag vocabulary. The model accepts arbitrary textual descriptions and interprets them. The Fish Instruction Benchmark measures this: S2 scores a 93.3% tag adherence rate with 4.51 out of 5.0 quality.

A sheet of handwritten text with colorful inline annotations. Brackets around words contain natural language emotion instructions like whisper and laugh. Waveforms emanating from each annotated section show visibly different patterns representing different emotional qualities. — Free-form emotion tags replace rigid dropdown menus. Write what you want and the model interprets it.

The Benchmark Blitz

Numbers do not lie, and S2's numbers are striking. On the Seed-TTS Eval benchmark, it achieves a Chinese WER of 0.54% and an English WER of 0.99%. Both are the best scores recorded by any system, open or closed.

For context, Alibaba's Qwen3-TTS scores 0.77% / 1.24%. MiniMax Speech-02 scores 0.99% / 1.90%. ByteDance's own Seed-TTS, the benchmark's namesake, scores 1.12% / 2.25%. S2 does not just edge ahead. It sets a new floor.

Benchmark	Fish Audio S2	Best Competitor	Margin
Seed-TTS Eval WER (Chinese)	0.54%	Qwen3-TTS (0.77%)	30% lower
Seed-TTS Eval WER (English)	0.99%	MiniMax (0.99%)	Tied / best
Audio Turing Test	0.515	Seed-TTS (0.417)	+24%
EmergentTTS-Eval Win Rate	81.88%	Next best	Highest overall
Instruction Adherence (TAR)	93.3%	N/A	Fish benchmark
Multilingual Best WER	11/24 langs	Varies	Most wins
Multilingual Best SIM	17/24 langs	Varies	Most wins

The Audio Turing Test is particularly telling. A score of 0.515 means human evaluators genuinely could not tell whether they were listening to a real human or to S2. The model surpasses Seed-TTS by 24% and MiniMax-Speech by 33% on this metric.

On EmergentTTS-Eval, S2 shines brightest in paralinguistics (91.61% win rate), questions (84.41%), and syntactic complexity (83.39%). These are not easy categories. They test whether the model handles the subtle prosodic demands of real speech.

A fish crossing a finish line far ahead of several other runners. Each runner has a name tag representing a competing TTS system. The fish leaves behind a trail of sound waves. A scoreboard in the background shows benchmark numbers. — The benchmark results are not close. S2 leads across the board.

Production Streaming via SGLang

Here is where the "isomorphic to LLMs" property pays off enormously. Because the Dual-AR architecture looks like a standard autoregressive language model to the serving infrastructure, S2 directly inherits every optimization that SGLang provides for LLM inference.

That means continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. None of these needed custom implementation. They just work.

The production numbers on a single NVIDIA H200 GPU: a Real-Time Factor of 0.195 (meaning audio generates roughly 5x faster than real-time playback), time-to-first-audio under 100 milliseconds, and throughput of 3,000+ acoustic tokens per second while maintaining RTF below 0.5.

"Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, S2 directly inherits all LLM-native serving optimizations from SGLang."

-- Fish Audio S2 Technical Report

For anyone who has struggled to deploy custom TTS models in production, this is a revelation. You do not need to build a bespoke serving stack. You use the same battle-tested infrastructure that powers LLM APIs at scale.

Voice Cloning Without Fine-Tuning

S2 supports zero-shot voice cloning from a short reference sample, typically 10 to 30 seconds. The model extracts the speaker's timbre, speaking style, and emotional tendencies from the reference and applies them to any new text.

No fine-tuning required. No per-voice model training. Just provide a clean audio sample and the model handles the rest. The cloned voice maintains consistency across long passages and different emotional registers.

A brief snippet of recorded audio being fed into a machine. On the other side the machine produces a continuous stream of new speech in the same voice. The machine interior shows gears and transformers processing the signal. — Ten seconds in, unlimited speech out. Zero-shot voice cloning needs no fine-tuning.

Multi-Speaker, Multi-Turn

S2 natively supports multiple speakers in a single generation pass. Upload reference audio containing multiple voices, and the model assigns each speaker an identifier token like <|speaker:1|> and <|speaker:2|>. You then control which voice speaks each line.

This eliminates the traditional TTS workflow of generating each speaker separately and stitching audio files together. Dialogue, podcasts, and audiobooks become single-pass operations.

Multi-turn generation extends the context window so that previous audio informs subsequent speech. The model uses earlier output to improve expressiveness and maintain consistency across a conversation. Earlier turns set the emotional and prosodic tone for later ones.

The Competitive Landscape

The open-source TTS space has shifted dramatically over the past year. In 2024, ElevenLabs was the unquestioned quality leader and most open alternatives sounded robotic by comparison. That gap has collapsed.

StyleTTS2 matches ElevenLabs on naturalness. Coqui XTTS matches it on clarity. Bark beats it on raw expressiveness. But none of these projects combine all dimensions the way S2 does: quality, speed, multilingual coverage, emotion control, and production-readiness in a single package.

System	Type	Voice Cloning	Emotion Control	Languages	Streaming
Fish Audio S2	Open source	Zero-shot (10-30s)	Free-form inline tags	~50	Sub-100ms TTFA
ElevenLabs	Closed API	Fine-tune or instant	Style presets	29	Yes
Qwen3-TTS	Open source	Zero-shot	Limited	Multi	Yes
MiniMax Speech-02	API	Zero-shot	Tags	Multi	Yes
Bark (Suno)	Open source	Speaker prompts	Non-verbal tokens	13	No
StyleTTS2	Open source	Reference style	Style transfer	English-focused	Limited

The cost picture is equally telling. Fish Audio's hosted API runs at roughly $0.05 per minute of generated audio. ElevenLabs charges approximately $0.18. That is a 70% cost reduction. And self-hosting S2 brings the marginal cost down to GPU compute alone.

A panoramic ocean scene where various fish of different sizes represent different TTS systems swimming in the same waters. The Fish Audio fish is the largest and fastest. Some fish are trapped in fish tanks representing closed-source APIs while others swim freely representing open-source models. — The TTS ocean. Open-source models now swim alongside and ahead of proprietary APIs.

The Codebase

The repository is written almost entirely in Python (334k lines) with a clean modular structure. The core model code lives in fish_speech/models/, split between text2semantic/ (the Slow AR transformer, LLaMA-based, with LoRA support) and dac/ (the audio codec with RVQ implementation).

Training uses PyTorch Lightning with Hydra for configuration management. The inference engine in fish_speech/inference_engine/ handles reference audio loading, VQ management, and the bridge to SGLang. A Gradio-based WebUI and a standalone API server provide two ways to interact with the model locally.

Dependencies are managed through uv with optional extras for different CUDA versions (cu126, cu128, cu129) and CPU-only mode. Docker support is first-class with both development and production Dockerfiles.

fish-speech/
  fish_speech/
    models/
      text2semantic/    # Slow AR transformer (LLaMA-based)
        llama.py        # Core model architecture
        lora.py         # LoRA fine-tuning support
        inference.py    # Generation logic
      dac/              # Audio codec
        modded_dac.py   # Modified Descript Audio Codec
        rvq.py          # Residual Vector Quantization
    inference_engine/   # SGLang bridge + streaming
    tokenizer.py        # Text tokenization
    train.py            # PyTorch Lightning training
  tools/
    server/             # Production API server
    run_webui.py        # Gradio interface
    vqgan/              # VQ-GAN utilities

The License Question

S2 ships under the Fish Audio Research License, not a standard open-source license like MIT or Apache. Research and non-commercial use are free and unrestricted. Commercial use requires a separate written license from Fish Audio.

This is a common pattern for state-of-the-art models. Meta's LLaMA started similarly before eventually loosening terms. For hobbyists, researchers, and evaluators, the license is permissive enough. For businesses, contact business@fish.audio for commercial terms.

Getting Started

Installation is straightforward with uv or pip. The project requires Python 3.10+ and PyTorch 2.8. A single command pulls model weights from HuggingFace and launches the WebUI.

# Install with uv (recommended)
pip install uv
uv pip install fish-speech

# Or clone and install
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
uv pip install -e .

# Launch WebUI
python tools/run_webui.py

For production deployment, the SGLang integration via sglang-omni provides the full serving stack with continuous batching and streaming. Docker compose files handle the orchestration.

What This Means

Fish Audio S2 is a inflection point for TTS. An open-source model now sits at the top of every major benchmark, beating systems built by companies with orders of magnitude more resources. The trajectory is clear: within the next year, the best open-source TTS will consistently outscore closed-source APIs in blind listening tests.

The technical architecture is equally important. By making the model structurally identical to an LLM, Fish Audio unlocked the entire LLM serving ecosystem for free. That is not just an engineering convenience. It means every future improvement to LLM inference infrastructure automatically benefits S2.

For developers, the message is simple. If you are building anything that needs high-quality speech synthesis, self-hostable multilingual TTS, or fine-grained emotional control, Fish Speech S2 is the new starting point.