Waiting for the Speakularity

8/10/12Follow @wroush

(Page 2 of 2)

Dragon Systems, which is now part of Nuance Communications. The problem is that it has never quite met all three of Thompson’s criteria. If it was fast and decent, it wasn’t free, and if it was close to free, it wasn’t decent.

Lately, though, that’s been changing. Today’s mobile devices have both powerful internal processors and broadband connections to external, cloud-based speech transcription engines. Nuance introduced its “Dragon Dictation” app for Apple iOS devices in 2009, giving users the ability to dictate short stretches of text—about a paragraph. Smartphones with Google’s Android operating system have had a built-in Voice Actions feature since 2010. In 2011, Apple came out with the iPhone 4S, which had dictation capabilities, not to mention the speech-driven Siri virtual personal assistant, baked in. And this year, Apple put dictation into both the third-generation iPad and the Mountain Lion update of its Mac OS X operating system.

One of the big constraints in all of these systems, right now, is on the length of the passage that can be transcribed. The Google, Nuance, and Apple technology works great for dictating reminders, text messages, short e-mails, and the like, but it can’t handle continuous speech. I’m guessing that’s because all of the heavy lifting (identifying speech sounds and probabilistically assigning text to them) is happening in the cloud, and there’s a limit on the size of the sound files that can be uploaded and deciphered in one go.

Another, bigger hurdle is that today’s commercial speech recognition technology still has a very hard time dealing with multiple voices, especially if they’re talking over one another (as humans routinely do). The Holy Grail would be a service that provided continuous, speaker-independent transcription of conversations between two or more people. The finished transcripts would be fodder not just for search engines but for a new wealth of newspaper, magazine, and blog stories.

Thompson predicted that Google will be the first to bring together all the elements of the vision, and I think that’s a good bet, given the company’s enormous computational resources, its experience with services like Google 411 and automatic YouTube captioning, and the depth of its bench in areas like natural language processing and machine translation. But you can’t count out Nuance or Apple (which uses Nuance’s technology in Siri and the iOS dictation feature), and research institutions such as SRI International, which are also thinking hard about this stuff.

I’m ready for the Speakularity now—but realistically, I’ll probably have to keep taking manual notes for the next few years. Just cut me a break if I’m interviewing you, my buffer flows over, and I have to ask you to rewind.

Wade Roush is a contributing editor at Xconomy. Follow @wroush

Single Page Currently on Page: 1 2 previous page

By posting a comment, you agree to our terms and conditions.