Nvidia's RAD-TTS generates realistic AI voices that are more expressive

The AI can be directed like a voice actor

By Cal Jeffrey August 31, 2021, 19:11

Nvidia's RAD-TTS generates realistic AI voices that are more expressive

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

In context: Synthesized voices have come a long way over the years. Gone are the days of synthetic voices sounding like a robot from a 1960s science fiction movie. Contemporary AI assistants like Alexa and Siri produce a much more realistic human-sounding voice.

As far as synthesized voices and text-to-speech has come, it's still not perfect. However, Nvidia's text-to-speech research department has developed some machine-learning tools for making voice synthesis more realistic in various applications.

Nvidia has developed an AI model called RAD-TTS. Developers can train the model with their own voice, and it will convert text prompts to natural speech using the inflections and tones it has learned. It can also convert one speaker's voice to that of another.

"Another of its features is voice conversion, where one speaker's words (or even singing) is delivered in another speaker's voice," says Nvidia. "Inspired by the idea of the human voice as a musical instrument, the RAD-TTS interface gives users fine-grained, frame-level control over the synthesized voice's pitch, duration, and energy."

You can see examples of the technology in use in Nvidia's "I AM AI" video series. Nvidia's video producer read the script in these demos, and the model converted his voice to a female narrator. Once the model has a baseline script, the developer can tweak the narration to emphasize specific words and modify the pacing to fit the video.

The tech has potential in many areas, including automated customer service, language translation, aids for those with disabilities, and even games. Virtually any application requiring a natural-sounding human voice has the potential to benefit from RAD-TTS.

"Several of the models are trained with tens of thousands of hours of audio data on Nvidia DGX systems. Developers can fine tune any model for their use cases, speeding up training using mixed-precision computing on Nvidia Tensor Core GPUs," reads the company's blog post.

The tools are GPU-accelerated and are, of course, optimized for use on computers equipped with Nvidia graphics cards. However, its work is open source and free to use for any developers interested. Nividia has made it available in the Nvidia NeMo Python toolkit on its NGC hub of containers and software.

4 comments 270 likes and shares

// Related Stories

Featured on TechSpot