Yesterday an artificial intelligence firm in Canada called Lyrebird announced that it had developed algorithms that can mimic anyone’s voice using only 60 seconds of audio. They are not the only company that is working on human-like voice synthesis. The Verge points out that both Adobe and Google have voice synthesis divisions working on developing realistic computer generated speech.
Using machine learning AI, Google has developed life-like voice synthesis that even incorporates the breath sounds that are heard in human speech. Adobe has been working on Project VoCo, a tool similar to Photoshop, except for audio editing. Its software can make just about any voice say anything, but it needs a 20-minute speech sample to accomplish this.
The fact that Lyrebird only needs 60 seconds of audio is a significant factor, but the resulting voice still sounds robotic. However, the software does get the pitch, accent, and some of the speaking mannerisms though, as this sample of a conversion between Donald Trump, Barack Obama, and Hilary Clinton shows.
It will be some time before it is of a quality that would fool someone, but that raises the question: What happens when it or any other software can fool someone? A successful Turing test for voice synthesis is not that far away judging from some of the samples already out there. So what happens when a computer can fool a human into thinking that they are hearing another human? The potential for abuse is obvious.
According to the Verge, “We already know that synthetic voice generators can trick biometric software used to verify identity.”
Additionally, a combined effort between the University of Erlangen in Nuremberg, the Max-Planck-Institute of Informatics, and Stanford University has resulted in a system that is capable of taking video of anyone and transforming it into a controllable puppet, complete with realistic facial gestures and mouth movement. The team demonstrated the technology they call Face2Face, editing YouTube videos of Donald Trump, George W. Bush, Barrack Obama, and Vladimir Putin in real time.
Combining something like Face2Face with Lyrebird’s, Google’s, or Adobe’s algorithms, or a combination of them could render realities that are completely false. In the wrong hands, use of technologies like these could cause much damage. Regardless of the potential for misuse, Lyrebird believes that the technology's development is inevitable.
“This technology is going to happen. If it’s not us, it’s going to be someone else,” Alexandre de Brébisson of Lyrebird stated.
The company has indicated that knowledge is power when it comes to technology such as this. It plans to release its software “publicly” with the intent being that if everyone knows this technology exists, the dangers of its misuse are lessened.
“The situation is comparable to Photoshop. People are now aware that photos can be faked. I think in the future, audio recordings are going to become less and less reliable [as evidence]," said Brébisson.
There is no word yet when the software will be ready for commercial consumption nor how much it will cost. However, Brébisson told the Verge that over 6,000 people had expressed interest in having early access to the APIs, which the company’s website says "are still in beta.” He also revealed that the company plans on providing versions for other languages as well.
We have all joked about how our politicians are just puppets. In the near future, that may be the case in a very literal sense. However, it is not all bad. For example, you may be able to program Siri to sound like whomever you wish, so long as you have 60 seconds of their voice to feed the algorithms. And what about those terrible automated telephone systems? Realistic voice synthesis and further advances in voice recognition are going to make talking to customer service robots much more tolerable, and who is not for that?