Google makes speech-to-text available completely offline in Gboard

Greg S

Posts: 1,607   +442
In brief: Going fully offline for speech recognition opens up new possibilities as to how different AI elements can be included within Android OS and other mobile devices. Google's speech-to-text feature in Gboard is now fully capable of real time speech recognition while offline.

Digital assistants and voice-controlled gadgets have come a long way in just a few years. While eyes have been heavily focused on Amazon Alexa and Google Assistant as the leading voice recognition services, the technology behind them is hidden on remote servers. With the latest update to Gboard, Google's onscreen Android keyboard, speech recognition goes completely offline.

Traditional speech recognition algorithms take words and break them down into parts called phonemes. Each part would typically be a 10 millisecond segment which would then be processed. Latency of this method is not horrible, but there is a guarantee that the results will not be output in real time.

Google's new speech recognition software has been in the works since 2014. Instead of looking for pieces of each word, individual letters are recognized and output as soon as a complex set of neural networks do their magic.

When Google first started development, the model used required a 2GB search graph. For many mobile phones, this is not an amount of storage that can be sacrificed just to support speech-to-text on a keyboard. After switching over to a recurrent neural network transducer from a more traditional approach, the file size was slashed down to 450MB.

Going further yet again, a 4x compression was applied along with some custom hybrid kernel techniques that are now available as part of the TensorFlow Lite library. After nearly five years of work, there is a mere 80MB model that can run in real time on just a single core.

Permalink to story.

 
Wow, finally we get the tech we needed even when phones actually did that already over 10 years ago... with that said, I'll stick to my slider keyboard, I only have to use a single point of touch and I feel it as fast as typing in a computer keyboard.
 
This is awesome news... having offline mode isn't so much a privacy thing for me (though it certainly is a concern) it is more of a responsiveness thing because I am often in areas with little or no network coverage.

I wonder if this offline system is available for the iOS version of GBoard. Hmm.
 
"Ask and ye shall receive"

The real point is that this does not require a connection to a translation server and is therefore much less subject to being overheard by others.
In the mid-1990s, when I first began working with what I call direct speech recognition (with translation occurring onboard the device rather than being routed through a translation server), the big problem was that the software itself, then usually Dragon NaturallySpeaking, required more RAM (1+ GB) then could be installed on the mainboards of the day. Many of you older techs may remember, Dragon, Microsoft and other companies used a 'swap drive' partition on the hard drive as a fall-over receptacle for data that that could not fit into the RAM available on the mainboards of the day. Unfortunately, the operational speed of the standard hard drives than in use was so much slower than the processing speed of RAM that data running on the swap drive 'timed out' the Dragon speech recognition program.
Adding in the spindle latency, the total time it took the average IDE hard drive of that era to start to do something was about 20+ ms. I engineered a fix substituting a parallel-SCSI hard drive in place of the IDE "C" drive. The SCSI hard drives of that day had < 8 ms of total latency, much less than IDE hard drives. The Dragon software now worked as advertised!
There are still some other roadblocks to be dealt with, one of the most agonizing being the speech quality supplied by most microphones and computer sound hardware to speech recognition programs. The microphones distributed with most speech recognition software and the OEM units supplied on smart phones are usually trash , incapable of properly reproducing the higher frequencies and finer nuances of human speech. Most are capable of reproducing only the basic vocal speech range of @ 300 to 3000 cycles. This may be okay for basic speech reproduction but much of the detail is lost and that is what truly good speech recognition requires for vocabulary comprehension and high accuracy.
I've personally been using direct speech recognition for more than 20 years. I'm currently dictating this article using Nuance NaturallySpeaking direct speech recognition spoken into a high-quality Sennheiser microphone, correctly positioned on an adjustable boom. The output is fed to the computer system through a quality USB sound card. It is also helpful to be in a quiet environment, and speak clearly at a consistent volume. I generally achieve 98% no-mistake accuracy. Even with my Note 8 a quiet environment and a very good online connection, I rarely achieve much better than 85% accuracy.

Note:
Edited 5.25.19 to change "discrete' SR to "direct" SR per current industry usage. Cheers!
 
Last edited:
Back