Meta-AI Voicebox Set to Push Boundaries in Generative AI

  • Home
  • Trending
  • Meta-AI Voicebox Set to Push Boundaries in Generative AI

Researchers in Meta-AI recently made a major feat in speech generative AI. The company recently created and launched Voicebox. Voicebox is an Artificial Intelligence (AI) model that demonstrates cutting-edge performance. The model can successfully and easily adapt to speech-generation tasks.

Voicebox sets itself apart from other speech-generation tools by employing a pioneering technique, Flow Matching. Voicebox has a 20 times remarkable speed improvement. Additionally,

synthesizes speech in at least six languages. It executes tasks like content editing, noise removal, and style conversion.

Voicebox Features and Uses

In the past, generative AI models for speech needed extensive training for tasks using well-curated data. Fortunately, Voicebox overcomes this limitation by leveraging raw audio and transcription. This groundbreaking advancement can change any segment within any given sample instead of being restricted to altering only the audio clip’s end.

To train Voicebox, researchers utilized more than 50,000 hours of public-domain audiobook speech and transcripts in several popular languages. The Meta-trained model can predict speech segments based on the speech and transcripts. Voicebox generates speech portions in an audio recording without requiring to recreate all the input by learning to fill in missing speech.

This AI can generate speech that more accurately depicts genuine human conversation in different languages due to its training on various real-world data. This lets it generate synthetic data for speech assistant model training.

Furthermore, the model contextual learning enables it to proficiently and seamlessly edit audio recordings. It can resynthesize segments caused by short-duration noise or swap out incorrectly pronounced words. This can be done without recording a complete re-recording of the whole speech.

The researchers restricted public access to the Voicebox model and its code. This cautious approach is motivated by concerns about potential misuse and associated risks. In their research paper, they describe the creation of a robust classifier that can effectively differentiate between genuine speech and audio generated using this AI model.

The featured image is from


Md Asif Rahman

Asif is a freelance writer and journalist who's been writing in Crypto, FinTech, Metaverse and Web3.0 spaces since 2019. He holds an M.Sc in Life Science and an MBA in Finance & Banking. His works have been published in an extensive list of publications including,,,, and many more. He also has a keen interest in Finance, AI and Cybersecurity. When not busy in writing, he can be found reading books and listening to music. LinkedIn: