Hibiki, Co-Founded by iliad Group, Revolutionizes Real-Time Voice Translation

Hibiki is a simultaneous voice translation model that uses a decoder to generate both text and audio in real time through an adaptive multistream model.
The Future of Word-by-Word Translation
Today marks a significant leap in word-by-word translation with the launch of Hibiki, a device-based model designed for high-fidelity simultaneous translation. Xavier Niel’s book title, “A Real Desire to Shake Things Up,” mirrors his groundbreaking approach in the telecom and tech worlds. This reflects a desire to challenge and innovate beyond traditional rules.
Hibiki, introduced by the artificial intelligence research lab Kyutai, co-founded by the iliad Group, embodies a radically new approach in voice translation. Rather than incremental improvements, Hibiki aims to redefine how simultaneous translation operates by overcoming current limitations:
Device-based, not cloud-based: Unlike most models that rely on internet connections and powerful servers, Hibiki operates locally, ensuring speed and privacy.
Smooth and natural translation: Unlike traditional systems that wait until the end of a sentence, Hibiki translates in real-time, adjusting to the speaker’s pace, fundamentally changing the user experience.
An open research model: As a non-profit lab, Kyutai is committed to making its advancements available to everyone, challenging the closed strategies of major AI players.
Hibiki Adapts to Your Pace
Unlike offline translation, which waits for the complete source sentence, Hibiki uniquely adjusts its pace to that of the user. By gathering sufficient context, it delivers accurate translations chunk by chunk as the speaker continues. “Hibiki generates natural speech in the target language,” explains one of our leading researchers, while also providing simultaneous textual translation.
An Efficient and Innovative Architecture
Hibiki utilizes Moshi’s multistream architecture to process both source and target speech concurrently. This allows it to handle the input stream continuously while generating target speech. A feature of Hibiki’s design is its ability to produce both text and audio tokens at a constant rate of 12.5Hz, enabling a continuous audio output stream with synchronized text translation.
Training Hibiki
Hibiki’s training is based on supervised alignment of source and target speech. Due to a scarcity of suitable data, synthetic data had to be created for training. Word-level alignment is performed between source and target transcripts using a weakly supervised contextual alignment method.
Evaluations and Results of Hibiki
Both objective and subjective evaluations of Hibiki demonstrate its effectiveness in translating from French to English, significantly outperforming existing methods. The quality, speaker likeness, and naturalness are nearly on par with human interpreters. Kyutai aims to expand Hibiki to many more languages, providing a comprehensive solution for live speech translation.
For more details, the repository on HuggingFace is available.