The StepFun AI team has released Step-Audio 2 Mini, an 8B parameter speech-to-speech large audio language model (LALM) that delivers expressive, grounded, and real-time audio interaction. Released under the Apache 2.0 license, this open-source model achieves state-of-the-art performance across speech recognition, audio understanding, and speech conversation benchmarks -- surpassing commercial systems such as GPT-4o-Audio.
Unlike cascaded ASR+LLM+TTS pipelines, Step-Audio 2 integrates Multimodal Discrete Token Modeling, where text and audio tokens share a single modeling stream.
This enables:
The model doesn't just transcribe speech -- it interprets paralinguistic features like pitch, rhythm, emotion, timbre, and style. This allows conversations with realistic emotional tones such as whispering, sadness, or excitement. Benchmarks on StepEval-Audio-Paralinguistic show Step-Audio 2 achieving 83.1% accuracy, far beyond GPT-4o Audio (43.5%) and Qwen-Omni (44.2%).
Step-Audio 2 incorporates multimodal RAG (Retrieval-Augmented Generation):
The system extends beyond speech synthesis by supporting tool invocation. Benchmarks show that Step-Audio 2 matches textual LLMs in tool selection and parameter accuracy, while uniquely excelling at audio search tool calls -- a capability unavailable in text-only LLMs.
This large-scale training allows Step-Audio 2 Mini to retain strong text reasoning (via its Qwen2-Audio and CosyVoice foundation) while mastering fine-grained audio modeling.