Dual-Autoregressive Architecture
How Fish Audio S2 splits generation across time (Slow AR) and depth (Fast AR) axes
Animate Generation Flow
Reset
Text Input
+ inline tags
Reference Audio
10-30s sample
SLOW AR
4B params / time axis
Decoder-Only Transformer
Primary codebook (semantic)
Codebook 1 Output
~21 Hz frame rate
FAST AR
400M params / depth axis
Depth Prediction Model
Codebooks 2-10 per timestep
t1
t2
t3
t4
t5
CB1
CB2
CB3-10
DAC Decoder
Waveform reconstruction
Audio Output
streaming waveform
GRPO Alignment
Multi-reward RL
Reward Models
4 signal types
tokens
embed
predict
condition
all 10 CBs
wav
align
score
TIME AXIS →
DEPTH ↓
Interactive Diagram:
Hover over components to learn about each stage. Click "Animate" to see the full generation flow.