May 26, 2026 ChainGPT

StepFun's StepAudio 2.5 Realtime: Human-Like, Persona-Stable Voice AI for Web3

StepFun's StepAudio 2.5 Realtime: Human-Like, Persona-Stable Voice AI for Web3
StepFun’s new real-time voice AI claims it can both act—and listen—like a human Shanghai AI lab StepFun this week released StepAudio 2.5 Realtime, an end-to-end, real-time voice model that takes audio in and returns audio out (no intermediate text). The model supports Chinese and English and, according to StepFun’s benchmarks, outperforms current live voice systems on several key measures—most notably in reading non-verbal vocal cues. What StepAudio does differently - End-to-end realtime audio: audio input → audio output, built for low-latency spoken interactions and longer roleplay sessions. - Persona stability via roleplay-specific RLHF: StepFun says it trained the model with reinforcement learning from human feedback focused specifically on keeping characters “in‑character.” Training began with 10,000 human-authored persona seeds that were algorithmically expanded into a million-scale feature matrix so the model can better resist drift during long or adversarial conversations. - Paralinguistic comprehension: the model extracts non-verbal cues—tone, speaking rate, inferred age, emotion—from raw audio before generating a reply, which the company highlights as a core differentiator. Benchmark snapshots (StepFun’s reported scores) - Paralinguistic comprehension (0–100): StepAudio 82.18; GPT Realtime 1.5 80.46; Gemini Live 58.05; DouBao Realtime 16.09. - Human evaluation (real users via mobile app, 0–100): StepAudio 80.41; GPT Realtime 1.5 68.01; Gemini Live 67.16. - General dialogue quality (API, 0–100): StepAudio 86.36; GPT Realtime 1.5 81.60. StepFun notes these are its own benchmarks—takeaways should weigh that—but the margins on paralinguistics and live spoken Q&A are substantial enough to be notable. Company context - Founded: April 2023 by Jiang Daxin (16 years at Microsoft working on Bing, Cortana, Azure cognitive services). - Notable prior work: Step 3.5 Flash, a 196-billion-parameter text model that topped four reasoning benchmarks earlier this year against much larger rivals. - Funding / status: One of China’s “AI Tiger” startups, with roughly $1.7 billion raised so far. Product and developer access - Launch includes a flagship persona called Xiao Yue, billed as a “soul-level companion” that’s meant to feel like texting a friend—opinions, catchphrases, emotional limits all configurable. - Developers can create and customize personas via the API. Documentation and access are at platform.stepfun.com; the model is live now. Why crypto audiences should care - Voice-native, persona-stable agents matter for web3 use cases: voice interfaces for trading, DAOs, immersive metaverse worlds, game NPCs, and monetized companion avatars could all benefit from lower-latency, character-faithful, emotionally aware voice AI. - The API-first release signals potential for third-party integrations—NFT voice personas, voice-enabled dApp assistants, and in-game NPCs are practical early adopters. Bottom line StepAudio 2.5 Realtime positions StepFun as a contender in live voice AI, with a particular emphasis on persona persistence and acoustic empathy. The company’s claims look strong on its own tests; developers and integrators should test directly to judge how those gains carry into real-world crypto and gaming scenarios. Read more AI-generated news on: undefined/news