New Microsoft AI can accurately mimic a human voice after analyzing a 3-second sample

midian182

Posts: 9,742   +121
Staff member
What just happened? Concerned that artificial intelligence seems to be advancing at a rapid pace, potentially threatening more human jobs? Then here's some news that could add to those concerns. A team of Microsoft researchers has announced a new AI that can accurately mimic a human voice from a mere three-second-long audio sample.

Microsoft's voice AI tool, called Vall-E, is trained on "discrete codes derived from an off-the-shelf neural audio codec model" as well as 60,000 hours of speech—100 times more than existing systems—from more than 7,000 speakers, most of which come from LibriVox public domain audiobooks.

Ars Technica reports that Vall-E builds on a technology called EnCodec that Meta announced in October 2022. It works by analyzing a person's voice, breaking down the information into components, and using its training to synthesize how the voice would sound if it were speaking different phrases. Even after hearing just a three-second sample, Vall-E can replicate a speaker's timbre and emotional tone.

"Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system [AI that recreates voices it's never heard] in terms of speech naturalness and speaker similarity," states the research paper, available at Cornell University. "In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis."

You can hear examples of Vall-E recreating voices on GitHub. Many are genuinely amazing, sounding almost identical to the speaker despite being based on such a short audio sample. There are a few that are a slightly more robotic and sound a bit closer to traditional text-to-voice software, but it's still impressive, and we can expect the AI to improve over time.

Microsoft's researchers believe Vall-E could find use as a text-to-voice tool, a way of editing speech, and an audio creation system by combining it with other generative AIs such as GPT-3.

As with all AIs, there are concerns about the potential misuse of Vall-E. Impersonating public figures like politicians is an example, especially when using it alongside Deepfakes. Or it could trick people into believing they are talking to family, friends, or officials and handing over sensitive data. There's also the fact that some security systems use voice identification. As for its impact on jobs, Vall-E would likely be a cheaper alternative to hiring voice actors.

Addressing the risks of Vall-E being misused, the researchers said these could be mitigated. "It is possible to build a detection model to discriminate whether an audio clip was synthesized by Vall-E. We will also put Microsoft AI Principles into practice when further developing the models."

Permalink to story.

 
We already have MoeTTS (online instance) and from my testing, while it is still apparent as being synthetic, it's also good enough that if you're doing it for funsies it works well enough for the characters that it has model for and you can make more models for it -- though to be fair, visual novels do tend to have a LOT of voices to sample from.

Also, Neuro-sama, while also synthetic-feeling, has gotten to a state where it's cute enough that people don't mind listening to her streaming.

The real breakthrough will be in making it easy enough for laypeople to use it on their 2020s laptops. I really do hope AMD catches up in that regards because a decent amount of the AI stuff I found seems to assume Nvidia GPU and that sucks. Also, kinda hoping for Linux peeps to eventually make a simple GUI to set these things up, but that's a pipe dream and I'd settle for the AUR repos being on par with the Windows installers.
 
In my opinion, AI is advancing unchecked, with no control, no regulation and no care. I wonder how many countries and how many universities impose regulatory principles during development, let alone a common global base. We have to establish core principles during the development phase, before it goes out of hand. With current trend, it's hard to anticipate exactly when the AI will gain self-awareness. Could be tomorrow, could be in 10 years. And I see no regulation concerning who and for what purpose it could be used.
 
You'll just need to
One
Setup your Windows Sound settings
and ... oh ... sorry you unplugged an audio device
Two
Windows just reset the Sound settings
Goto One
 
Back