
Cartesia's Sonic-3 Text-to-Speech (TTS) system, describing it as a significant advancement built upon State Space Model (SSM) architecture.
This new design overcomes the limitations of older models like Transformers, enabling ultra-low latency (below 150ms) and highly expressive speech that includes non-speech vocalizations like laughter. The report emphasizes Sonic-3's global strategy, which includes support for 42 languages, and introduces the "Artificial Analysis arena" for automated, objective quality control, moving beyond the traditional Mean Opinion Score (MOS).
Furthermore, the text dedicates significant attention to the ethical responsibilities accompanying such powerful technology, advocating for safeguards like audio watermarking and "Responsible Evaluation" to prevent misuse and deepfake creation. The system is positioned to transform conversational AI, media, and customer service applications due to its balance of quality, speed, and integrity.