Nvidia's RAD-TTS generates realistic AI voices that are more expressive

Cal Jeffrey

Posts: 3,166   +872
Staff member
In context: Synthesized voices have come a long way over the years. Gone are the days of synthetic voices sounding like a robot from a 1960s science fiction movie. Contemporary AI assistants like Alexa and Siri produce a much more realistic human-sounding voice.

As far as synthesized voices and text-to-speech has come, it's still not perfect. However, Nvidia's text-to-speech research department has developed some machine-learning tools for making voice synthesis more realistic in various applications.

Nvidia has developed an AI model called RAD-TTS. Developers can train the model with their own voice, and it will convert text prompts to natural speech using the inflections and tones it has learned. It can also convert one speaker's voice to that of another.

"Another of its features is voice conversion, where one speaker's words (or even singing) is delivered in another speaker's voice," says Nvidia. "Inspired by the idea of the human voice as a musical instrument, the RAD-TTS interface gives users fine-grained, frame-level control over the synthesized voice's pitch, duration, and energy."

You can see examples of the technology in use in Nvidia's "I AM AI" video series. Nvidia's video producer read the script in these demos, and the model converted his voice to a female narrator. Once the model has a baseline script, the developer can tweak the narration to emphasize specific words and modify the pacing to fit the video.

The tech has potential in many areas, including automated customer service, language translation, aids for those with disabilities, and even games. Virtually any application requiring a natural-sounding human voice has the potential to benefit from RAD-TTS.

"Several of the models are trained with tens of thousands of hours of audio data on Nvidia DGX systems. Developers can fine tune any model for their use cases, speeding up training using mixed-precision computing on Nvidia Tensor Core GPUs," reads the company's blog post.

The tools are GPU-accelerated and are, of course, optimized for use on computers equipped with Nvidia graphics cards. However, its work is open source and free to use for any developers interested. Nividia has made it available in the Nvidia NeMo Python toolkit on its NGC hub of containers and software.

Permalink to story.



Posts: 85   +110
Just need to combine this with the ability to mimic/synthesize a sample of a specific person's voice, slap it on top of a good deep fake video, and, well.......****, nothing is real anymore


Posts: 3,718   +1,787
I think the most interesting tech will be/use AI in some form. Software is advancing faster than ever before. It almost needs to be, because hardware advancements are kind of boring and slow moving to be honest. We wait for more cores, more cache, higher clocks, blah blah. With software we don't know what will come next. We don't always know what company it will come from, what it will do, etc.

The fears are valid, but what's a reasonable alternative? Maybe human intelligence had peaked and we need the help? lol Just thinking out loud there, but come on, a lot of it looks really cool. Maybe the future won't suck and we'll have AI to thank for it. I know I'm not ready to dismiss it until I have a reason to. Thoughts like that can only kill innovation before we get to hear about it.

Mighty Duck

Posts: 202   +145
When it comes to games, this could be used for games with plenty of dialogue trees. This way developers could change the game's script without the stress of having to contact the voice actor to re-record lines again.