VoiceSignal technology could be the way to voice enable your next application as their technology rapidly becomes the de-facto standard for "no training" voice recognition. Practical voice synthesis is about to follow. Richard Bloor considers talking to his phone.
Whether a Star Trek fan or not there is a good chance you have seen the scene from Star Trek IV: The Voyage Home where, on being offered use of a computer, Scotty picks up a mouse and starts talking to it.
Computer voice control has long been in the realm of science fiction. The ability to interact with computers, both to talk to them and have them talk back, is something of a holy grail for interface designers. It offers a faster, more convenient interaction with almost any application.
There is a pretty good chance that this technology is not yet on your PC. So it might come as something of a surprise to find that the cutting edge of consumer focused voice recognition and speech synthesis is happening on the mobile phone and being pioneered, almost single handedly, by one company: VoiceSignal.
In February VoiceSignal announced that it had shipped the 50 millionth unit of its VSuite speaker independent voice dialing application and has a 90% market share, as assessed by Voice Information Associates. According to Rich Geruson, VoiceSignal CEO, VSuite has helped VoiceSignal experience its third year in which revenues and volume of units shipped has doubled. It has also helped VoiceSignal become, in Rich's estimation, "one of the few profitable private pre-IPO technology companies in the world".
VSuite boasts an impressive range of features, voice dialing, concatenated commands (the ability to say "Send Richard an SMS" and have the phone open up the SMS editor with the sender selected as Richard) and continuous digit dialing. In addition to this command and control application, VoiceSignal have VoiceMode, a dictation system. This currently recognizes discontinuous speech, that is "Hi" (pause) "Matt" (pause), but will soon offer continuous speech recognition. To this will be added a voice synthesis application called VSpeak, which Rich claims uses 300-400Kb, but performs better than many desktop applications requiring tens of megabytes of code and data. Rich does admit that the speech is "somewhat robotic" but states it is "pleasant to listen to".
Until recently VoiceSignal has concentrated on supplying its software as an embedded feature on feature and smartphone. It offered the third party developer little, as the APIs were buried on phones that often offer only a Java Virtual Machine for after market applications. As these phones offered consumers limited options for upgrading the embedded sales model was the most appropriate.
This is changing as VoiceSignal's technology becomes available on open operating system devices, such as those using Symbian OS. Consumers now have devices which offer the power to run additional VoiceSignal applications and the API is available to third party developers.
Currently VSuite is being shipped as either a built-in option or as a try and buy application on a number of Nokia S60 devices and other open OS phones. Rich expects other Symbian OS licensees to offer VSuite or other VoiceSignal products in a similar way in the future.
VoiceSignal has been offering its products as consumer downloads for Palm and Samsung devices already. On the face of it Symbian OS would seem to offer an ideal platform for consumer downloads. "Symbian OS phones have a good API interface for the audio path, ideal for our technology," explains Rich. "However, it has not always been implemented fully and consistently in all Symbian OS devices. The biggest issue has been that the audio gain has varied between devices. This has meant that each device has required a uniquely tweaked application."
The main issue revolves around the quality of the audio. A phone which includes a handsfree speakerphone may amplify the microphone in such a way that key frequencies, used by VSuite in voice recognition, are clipped. Some phones also include noise suppression features, which suppress frequencies in certain ranges. While the human ear can compensate for the reduction in these frequencies dynamically, voice recognition software can only do the same with specific changes to the recognition algorithms.
The overhead of creating a unique version of the application for each phone is not the only thing which has held VoiceSignal back from releasing consumer versions of its application for Symbian OS. While there would be additional development work, the complexity involved in ensuring the consumer purchases the correct version for their phone was as much of a concern.
"The implementation of the API has improved, so it is correct on most new phones. We are also able to set the gain in our software, which has simplified creating targeted versions," says Rich. "At the same time the gain level in S60 phones seems to have become uniform. So a single application will work without changes, but to avoid these problems we are adding functionality to automatically set the gain depending on the model," As a result VoiceSignal will soon have one version of its application each for UIQ and S60, to capture what Rich describes as the "vast and growing market for Symbian OS downloads". As a result retail applications will become a larger part of the company's business model.
However, perhaps this move has something to do with the desire of Symbian and its Licensees to drive Symbian handsets into the middle market, where a phones bill of materials (BOM) comes under pressure. "The per unit license cost of our technology to device manufacturers means it is not a significant part of a phone's total cost," says Rich. "Typically it is hardware features, the quality of the display, the camera, or amount of memory that makes up most of the BOM. These are the elements which play the biggest role in creating a phone priced for the midmarket."
So what is driving a move to sell directly to the consumer? "Our software is on 50 million handsets. With 2.8 billion handsets out there, we expect this number to grow significantly," says Rich. "As it does the opportunity to provide upgrades and enhancements between handset replacements becomes significant." These enhancements might include providing a local language variant. If a customer in Russia buys a handset that contains the standard 5 European languages, VoiceSignal will provide them the opportunity to download support for Russian. "In the future a customer's phone may have our voice synthesis application on it," says Rich. "We could provide voice skins so their phone sounds like Britney Spears or Mr. T".
Rich believes that increasing deployment of VoiceSignal's technology will open up opportunities for third party developers. "Today most of the developers using our VSAPI are within the device manufacturers," says Rich. "However, when VSuite or our other technologies are on a Symbian OS device the VSAPI will be available to any third party developer who wants to use it."
VSAPI gives software developers access to the speech recognition and text to speech functions within VSuite and VoiceSignal claims it is very simple to use. Through VSAPI, a developer can play audio prompts, define vocabularies (lists of words or phrases) to be recognized, and present text strings to be spoken by the engines within VSuite. VoiceSignal claims VSAPI is able to provide developers a single, consistent interface across all phones because the differences between phone platforms, including audio services, are resolved within VSuite. As a result, a third party developer should be able to create a speech-enabled consumer download application that will run without change on any Symbian OS phone with VSuite installed on it.
"VSuite is itself a speech application already tuned and optimized for speech recognition," says Jack Armstrong, VoiceSignal's VP of Market Development. "It makes it possible for VSAPI to be presented as a high level interface that dramatically simplifies the effort of developing voice user interfaces for other applications. VSuite automatically handles all memory management, I/O, and audio functions related to speech operations on behalf of the application. All the developer needs to do is to present a list of words or phrases to be recognized and prompts to be played that tell the user what they can say and when to say it. Using VSAPI, adding a speech user interface to an application can be accomplished in a matter of hours."
When using the voice recognition features of VoiceSignal the VSAPI returns a text string in exactly the same way the application would receive a text string from the keyboard. To make this text as meaningful as possible to a third party application, VoiceSignal has two modes for recognizing spoken words, known as constrained (grammar-based) and unconstrained recognition (dictation).
Unconstrained recognition is the broadest type of recognition. Here the recognition engine looks for the most probable word represented by the phonic pattern uttered by the user. This type of recognition is used for dictation in messaging or other open text entry applications. VoiceSignal is making an API for unconstrained recognition available to manufacturers so it can be integrated directly with their messaging clients or text input subsystems. For third party developers on Symbian OS this feature is most likely to be available as a Front End Processor (FEP), a user selectable option for text entry, and is unlikely to be used directly.
In constrained recognition the developer provides a list of words to be recognized. The lists can be static, compiled into the application for features such as menu choices or fixed command and control sequences. In addition or alternatively, they can be dynamic lists that vary according to the context of the application and may even be downloaded from a server. This provides the flexibility to use VSAPI to present multimodal interfaces to applications (such as games, productivity applications, or for searching); to allow forms to be completed by voice input. "Our technology could easily provide a voice interface to a navigation system," says Jack. "This could allow the user, with eyes and hands free, to ask the application for the 'next turn' or 'how much further' or 'read directions' as well as using voice to input a destination address. Then VSpeak technology could provide the answer."
Where the required VoiceSignal technology is not guaranteed to be on the target device, it can be licensed for use in an application. However, despite moves into the retail market Rich explains that VoiceSignal is still principally looking to get its applications shipped in ROM on devices. Amongst other things, the goal in doing this is to provide developers with a high degree of certainty that the APIs will be available.
Perhaps the hardest job VoiceSignal has in propagating its technology is overcoming the early hype surrounding voice recognition on mobile phones. The early systems required training of every command and contact. In addition, recognition rate in all but ideal conditions were poor. Three or four years ago it was the feature of a lot of new phones, now it barely rates a mention. Using a Sendo X should have been enough to convince anyone that VoiceSignal had made voice control practical. Hopefully it will not be too long before that practicality is available to any application.
For more information on VoiceSignal's products see www.voicesignal.com. |