Dual-Autoregressive Architecture

How Fish Audio S2 splits generation across time (Slow AR) and depth (Fast AR) axes

Text Input + inline tags Reference Audio 10-30s sample SLOW AR 4B params / time axis Decoder-Only Transformer Primary codebook (semantic) Codebook 1 Output ~21 Hz frame rate FAST AR 400M params / depth axis Depth Prediction Model Codebooks 2-10 per timestep t1 t2 t3 t4 t5 CB1 CB2 CB3-10 DAC Decoder Waveform reconstruction Audio Output streaming waveform GRPO Alignment Multi-reward RL Reward Models 4 signal types tokens embed predict condition all 10 CBs wav align score TIME AXIS → DEPTH ↓
Interactive Diagram: Hover over components to learn about each stage. Click "Animate" to see the full generation flow.