Researchers in Meta-AI recently made a major feat in speech generative AI. The company recently created and launched Voicebox. Voicebox is an Artificial Intelligence (AI) model that demonstrates cutting-edge performance. The model can successfully and easily adapt to speech-generation tasks.
Voicebox sets itself apart from other speech-generation tools by employing a pioneering technique, Flow Matching. Voicebox has a 20 times remarkable speed improvement. Additionally,
synthesizes speech in at least six languages. It executes tasks like content editing, noise removal, and style conversion.
Voicebox Features and Uses
In the past, generative AI models for speech needed extensive training for tasks using well-curated data. Fortunately, Voicebox overcomes this limitation by leveraging raw audio and transcription. This groundbreaking advancement can change any segment within any given sample instead of being restricted to altering only the audio clip’s end.
To train Voicebox, researchers utilized more than 50,000 hours of public-domain audiobook speech and transcripts in several popular languages. The Meta-trained model can predict speech segments based on the speech and transcripts. Voicebox generates speech portions in an audio recording without requiring to recreate all the input by learning to fill in missing speech.
This AI can generate speech that more accurately depicts genuine human conversation in different languages due to its training on various real-world data. This lets it generate synthetic data for speech assistant model training.
Furthermore, the model contextual learning enables it to proficiently and seamlessly edit audio recordings. It can resynthesize segments caused by short-duration noise or swap out incorrectly pronounced words. This can be done without recording a complete re-recording of the whole speech.
The researchers restricted public access to the Voicebox model and its code. This cautious approach is motivated by concerns about potential misuse and associated risks. In their research paper, they describe the creation of a robust classifier that can effectively differentiate between genuine speech and audio generated using this AI model.
The featured image is from techcut.com