New Microsoft AI tech can simulate our voice; should we be concerned?
The tech giant has unveiled VALL-E, a new text-to-speech AI that can simulate anyone's voice with a 3-second sample of human speech.
A "neural codec language model", VALL-E uses discrete codes derived from an off-the-shelf neural audio codec model to synthesize high-quality personalised speech with only a 3-second recording of an unseen speaker.
The AI is trained with 60,000 hours of English speech with over 7,000 unique speakers. All this data is taken from Libri-Light, the Meta-owned audio library that collects spoken English audio.
It can also imitate the speaker's emotional tone and acoustic environment.
"Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity," say Microsoft researchers in their paper.
The three-second voice input needs to match with some other samples in the training data provided to have a better result. This is why VALL-E should be more diverse in the future. The training data will be scaled up to improve the performances of prosody, speaking style, and speaker similarity perspectives, Microsoft says.
READ MORE: AI offers solutions, but needs to be more transparent: experts
How can we benefit from VALL-E?
For now VALL-E can only convert text into speech in the chosen voice. It can’t create new content.
Its creators are hopeful that VALL-E can provide various benefits in terms of speech editing and audio content creation.
The example of Stephen Hawking using a text-to-speech generator to continue his studies while suffering from classical motor neuron disease (ALS) has shown the world one of the highest benefits one could get from this technology.
VALL-E can be used for simultaneous translations, or to create the voice of our loved ones who had passed away.
Creating audiobooks would be a lot easier and faster with VALL-E. One can create a voice for any written peace or text message in a short time.
For all these uses and more, we need to wait for Microsoft to open VALL-E to public use. Microsoft has not said yet when the new AI will be available for public consumption.
READ MORE: Can an invention enabled by artificial intelligence be patented?
Researchers from Microsoft introduce an approach for text to speech (VALL-E). Approach can solve some interesting use cases: pic.twitter.com/gFcOCTQJrI
— Ralph Brooks (@ralphbrooks) January 6, 2023
VALL-E might bring risks
While the question of how to use AI technologies safely and ethically is being asked more often than ever these days, people express their ethical concerns over newly launched systems like Chat GPT, Lensa AI, or VALL-E.
Chat GPT, a new chatbot AI that can process natural language tasks such as text generation and language translation started debates on students committing plagiarism by using this AI for their homework.
At the same time, Lensa AI, an app that uses algorithms to generate ordinary photos into artistic renderings led to ethical questions on artistic production made by using other artists' works. Many argue that it cannot replace human artists who make digital art.
VALL-E, likewise, has potential risks of misuse that can criminalise the users, such as spoofing voice identification or impersonating a specific speaker.
Impersonating people's voices without their consent might fuel mischief and deception which could then lead to social harm.
Similar to Lensa's risk of replacing real artists, VALL-E also leads to concerns over ethics of art. In a case when music production companies make the AI sing new songs without the consent of the voice-owner singer, it might not be that much fun to use it.
6/ And the ethics side of things is not forgotten, when the model is generalised to unseen speakers, a protocol to ensure that the speaker agrees to execute the modification should be implemented, as well as a system to detect the edited speech.
— Adam K Dean (@imdsm) January 10, 2023
Microsoft's response to concerns
Microsoft says it is aware of these concerns and possible risks that robots might bring on. It had apologised in the past for its chatbot Tay's offensive tweets.
The researchers who created VALL-E stated in their paper that they are likely to build a measuring mechanism that can prevent such risks.
"Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker," they said in the paper.
"To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models." they added.
The company will see if it could build such a detection model and if so, how much of these risks will be mitigated by it, only when the project is open to public use, they said.
READ MORE: AI offers solutions, but needs to be more transparent: experts