Cybersecurity experts warn real-time voice deepfakes are here

Skye Jacobs

Posts: 1,972   +58
Staff
A hot potato: Artificial Intelligence has advanced to the point where systems can now clone voices convincingly in real time, letting attackers mimic anyone during a live conversation. The breakthrough removes earlier limits that depended on prerecorded clips or slow processing, raising new cybersecurity and identity verification concerns.

Cybersecurity firm NCC Group has demonstrated that combining open-source AI tools with off-the-shelf hardware can generate real-time voice deepfakes with minimal latency. The technique, dubbed "deepfake vishing," uses AI models trained on samples of a target's voice to produce live impersonations that operators activate via a start button on a tailored web interface.

The process requires only modest computing power, though high-end graphics processing units improve results. The researchers ran the system on a laptop with an Nvidia RTX A1000 GPU – a lower-tier card – and achieved delays of only half a second. Audio samples show the system can produce convincing voice replicas even from poor-quality recordings, suggesting it could function with built-in microphones on everyday laptops and smartphones – making malicious use easier.

Previous voice deepfake services often required several minutes of training data and produced only prerecorded clips, making them less adaptable for live improvisational interactions. The ability to change voice in real time eliminates the natural pauses and hesitations that would otherwise reveal an impersonation attempt.

NCC Group Managing Security Consultant Pablo Alobera shared that during controlled testing with client consent, combining the real-time voice deepfake with caller ID spoofing successfully deceived targets in almost every attempt. The breakthrough greatly improves the speed and realism of voice forgery, exposing new risks even in ordinary phone calls.

While voice deepfakes have made notable progress, real-time video deepfakes have not yet reached the same level of sophistication. Recent viral examples use cutting-edge AI models such as Alibaba's WAN 2.2 Animate and Google's Gemini Flash 2.5 Image (nicknamed Nano Banana), which can digitally transplant virtually anyone into realistic video scenarios.

However, these systems still struggle to produce high-quality video in live settings and often exhibit inconsistencies in facial expressions, emotions, and speech synchronization. Trevor Wiseman, founder of the AI cybersecurity firm the Circuit, told IEEE Spectrum that mismatches between tone and facial cues remain giveaways even to casual observers.

The growing prevalence of these technologies has already led to tangible consequences. Wiseman points to a case where a company was fooled during the hiring process, sending a laptop to a fraudulent address after being duped by a video deepfake. Such cases show that voice and video calls cannot be relied on for authentication.

As AI-driven impersonation becomes more accessible, experts warn that new forms of verification will be essential. Wiseman advocates adopting unique, structured signals or codes – akin to the secret signs used in baseball games – to confirm identity during remote interactions unequivocally. Without such measures, individuals and organizations remain at risk of increasingly sophisticated social engineering attacks powered by AI-generated deepfakes.

Permalink to story:

 
Best to ask your brokerage or bank to deactivate or disallow voice verification over the phone as a security measure. Most companies allow for this over customer service.
My bank uses it, but there are multiple other factors beyond voice that are required to access my accounts (things about me, things about the account itself), its a pain but necessary. They would never allow anyone to access my data solely based on my voice alone.
 
This has been possible for nearly a decade. Those so-called "security experts" need to set their clocks ahead a few years and get with the times.
 
This has been possible for nearly a decade. Those so-called "security experts" need to set their clocks ahead a few years and get with the times.
The "real time" part is what's important. You speak into a microphone and with near zero latency, what your saying sounds like the person you are impersonating. It used to take hours to render but it now be done in milliseconds
 
The "real time" part is what's important.
And?
You speak into a microphone and with near zero latency, what your saying sounds like the person you are impersonating. It used to take hours to render but it now be done in milliseconds
No it didn't. It's been possible real-time with very little latency(measured in under 30ms) since 2013. You just need enough compute power.
 
I'm fairly certain the random dudes in India have only recently obtained that type of compute power
Where it exists is not the point. The only thing about this which is new is the fact that it's wide-spread and easily accessed.
 
Where it exists is not the point. The only thing about this which is new is the fact that it's wide-spread and easily accessed.
You're deflecting. Whatever, I don't feel like arguing with people on my day off. It existed 12 years ago, you're right, have a nice day.
 
Back