Read the latest from the Web Foundation

News and Blogs

Voice XML and Voice for Development

Stéphane Boyera · May 20, 2009

For a long time now, I’ve wanted to write down some thoughts about voice applications, and voice as a channel to deliver services.

I’ve attended a series of conference in april and may. the W3C Workshop on mobile technologies for development, ICTD 2009, IST-Africa 2009 and it is obvious to me that voice applications are attracting more and more attention.

I see many reasons for that:

  • The availability on all phones (mobile or not) as this is just like a classical phone conversation.
  • The accessibility aspect: voice applications allow dissemination of information in any language of the world, and toward people with low reading skills.
  • The flexibility of the business model: it is easy for the service provider to decide who will pay the communication, the user or the service provider, through e.g. call-back mechanisms, free phone number.
  • The absence of limitation like the 160 characters of SMS.

Therefore, lots of reasons for the number of voice applications to take off.

Of course, this domain has also lots of challenges. Some of them are mentionned in a paper I wrote in January 2008 but that’s not the point i want to discuss in this post.

My concerns is about the current technology used for voice applications. Indeed there is not one single way of doing voice applications, but many different paths one can use, some are proprietary, some are standardized, some are tied to particular infrastructure,…

Let me start with a very rough overview of voice application infrastructure. In order to deliver content over telephony you need different layers:

  • a physical layer that connect the user with the service. this layer is in charge of receiving phone calls, or dialing back the user. it has also basic functionalities such as┬áplaying an audio stream, recording an audio stream, detecting and transmitting keypad press on the user side.
  • an application layer which implement the business logic of the application. This layer can be very simple and allow the author to manipulate only audio files, and capture keypad press. But it can also be a far more advanced version, with Text-to-speech (TTS) modules in different languages, Speech Recognition (SR) engines, ability to transfer the user to another external applications and so on.

What’s really critical is to separate the two layers. There are many reasons for that:

  • the ability to change the physical layer to, for example, scale-up a prototype.
  • the ability for software developers to have a standardized platform to develop content, or toolkit independently of the infrastructure that would be used to run it.

In order to meet this need of separation between physical and application layers, W3C, ten years ago !, in conjunction with major PBX vendors, and voice software developers and vendors launched the Voice Activity, which led to the definition of the VoiceXML family of markup languages.

VoiceXML and related languages not only offer a standardized way of developing voice applications, but also a standardized way of adressing TTS, SR and related extensions. It also integrates key features of the Web, such as hyperlinks.

The following graphics shows a simplified view:

{ Source: W3C Voice Browser Group.}

Having managed the voice activity at W3C few years ago, it is obvious to me that it is the only way to go. So i was really surprised to see that people, particularly NGOs, developing applications today in Africa, are not aware of this work, and the potential of the technology.

What i’ve seen so far is people using a particular physical layer and developing content and modules for that physical layer. I’m talking here about Asterisk, which is without discussion the leading free and open-source software PBX. But it is a PBX again, only the physical layer. All physical layers offer minimal application development capabilities. Here in the case of asterisk those capabilities are huge through modules, but still this is proprietary to the this physical layer.

Asterisk is an incredibly great tool, but that’s not the level at which the application should be developed. VoiceXML offers the full strenght of the Web, including the ability to connect applications done by others through hyperlinks, built-in functionality such as interface to TTS and SR, Forms, language supports,… the access through search engines, the ability to use the web to provide services,…

There are also lots of tools available for validating your application, for authoring and so on. Read the presentation of some of these tools.

Obviously, there are modules for asterisk to become a VoiceXML parser. 2 examples:

(Disclaimer: I didn’t have the chance to test those yet and see what are their compliance to W3C standard)

So now, how to promote the use of VoiceXML in the community, and help people making the right choice when they investigate voice applications?

I believe there is a great need for a large initiative around using Voice in Development: identifying existing tools, developing best practices and techniques, identifiying key usability issues, developing TTS and SR engines in major languages… but again it is critical to promote the use of the appropriate underlying technology, VoiceXML, the standardized, vendor-neutral, infrastructure-neutral, Web-based technology.

We are thinking about such an initiative at the Web Foundation, and I hope we will be able to propose it to the community in a very near future.


Your comment has been sent successfully.
  1. A Voice of the Web » Blog Archive » Sustainability in Mobile & ICTD

    August 28, 2012

    [...] has always the choice to design a system with open standards or with a proprietary approach. E.g. Voice-based applications is another example I blogged about a couple of years ago. There tons of other examples, like those who wrote symbian apps or palmpilot apps instead of [...]


    Your comment has been sent successfully.