From Seo Wiki - Search Engine Optimization and Programming Languages

Jump to: navigation, search

VoiceXML (VXML) is the W3C's standard XML format for specifying interactive voice dialogues between a human and a computer. It allows voice applications to be developed and deployed in an analogous way to HTML for visual applications. Just as HTML documents are interpreted by a visual web browser, VoiceXML documents are interpreted by a voice browser. A common architecture is to deploy banks of voice browsers attached to the Public Switched Telephone Network (PSTN) to allow users to interact with voice applications over the telephone.



Many commercial VoiceXML applications have been deployed, processing millions of telephone calls per day. These applications include: order inquiry, package tracking, driving directions, emergency notification, wake-up, flight tracking, voice access to email, customer relationship management, prescription refilling, audio newsmagazines, voice dialing, real-estate information and national directory assistance applications.

VoiceXML has tags that instruct the voice browser to provide speech synthesis, automatic speech recognition, dialog management, and audio playback. The following is an example of a VoiceXML document:

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">
        Hello world!

When interpreted by a VoiceXML interpreter this will output "Hello world" with synthesized speech.

Typically, HTTP is used as the transport protocol for fetching VoiceXML pages. Some applications may use static VoiceXML pages, while others rely on dynamic VoiceXML page generation using an application server like Tomcat, Weblogic, IIS, or WebSphere. In a well-designed web application, the voice interface and the visual interface share the same back-end business logic.

Historically, VoiceXML platform vendors have implemented the standard in different ways, and added proprietary features. But the VoiceXML 2.0 standard, adopted as a W3C Recommendation 16 March 2004, clarified most areas of difference. The VoiceXML Forum, an industry group promoting the use of the standard, provides a conformance testing process that certifies vendors implementations as conformant.


AT&T, IBM, Lucent, and Motorola formed the VoiceXML Forum in March 1999, in order to develop a standard markup language for specifying voice dialogs. By September 1999 the Forum released VoiceXML 0.9 for member comment, and in March 2000 they published VoiceXML 1.0. Soon afterwards, the Forum turned over the control of the standard to the World Wide Web Consortium.[1] The W3C produced several intermediate versions of VoiceXML 2.0, which reached the final "Recommendation" stage in March 2004.[2]

VoiceXML 2.1 added a relatively small set of additional features to VoiceXML 2.0, based on feedback from implementations of the 2.0 standard. It is backward compatible with VoiceXML 2.0 and reached W3C Recommendation status in June 2007.[3]

Future versions of the standard

  • VoiceXML 3.0 will be the next major release of VoiceXML, with new major features. It includes a new XML statechart description language called SCXML.


  • Comprehensive list of VoiceXML Browsers as listed on the W3C website.
  • OpenVXI is a portable open source VoiceXML interpreter available from Carnegie Mellon and developed by SpeechWorks. It may be used free of charge in commercial applications and allows the addition of proprietary modifications if desired. OpenVXI closely follows the VoiceXML 2.0 draft specification.
  • JVoiceXML is an Open Source VoiceXML interpreter for JAVA supporting JAVA APIs such as JSAPI and JTAPI. JVoiceXML is an implementation of VoiceXML 2.0, the Voice Extensible Markup Language specified at http://www.w3.org/TR/voicexml20/. The major goal is to have a platform independent implementation that can be used for free.
  • Vocalocity has implemented the latest specification of VoiceXML 2.0. Vocalocity's platform software is designed specifically for OEM and Channel Partners who provide unique open solutions to their customers. The Vocalocity platform supports multiple telephony, ASR and TTS engines as well as multiple operation systems.
  • Phonologies InterpreXer Server is an abstract implementation of the VoiceXML specification (v2.1), that can be integrated with any telephony platform, messaging suite or communications solution intended to implement the VoiceXML functionality. InterpreXer readily supports speech recognition (using an MRCPv2 based speech recognition Engine) or "touch tone" DTMF and dialog prompting via synthesized speech (using an MRCPv2 text-to-speech engine) or recorded audio playback. InterpreXer is best suited for highly scalable OEM implementations.
  • PublicVoiceXML is an open source implementation of a complete VoiceXML 2.0 browser. It is designed to work on low cost telephony hardware using DTMF navigation with hooks to 3rd party text to speech and speech recognition modules. Support and sample applications for the mobile world available. The source code is available on SourceForge and a Linux version will be available soon.

Related standards

The W3C's Speech Interface Framework also defines these other standards closely associated with VoiceXML


The Speech Recognition Grammar Specification (SRGS) is used to tell the speech recognizer what sentence patterns it should expect to hear: these patterns are called grammars. Once the speech recognizer determines the most likely sentence it heard, it needs to extract the semantic meaning from that sentence and return it to the VoiceXML interpreter. This semantic interpretation is specified via the Semantic Interpretation for Speech Recognition (SISR) standard. SISR is used inside SRGS to specify the semantic results associated with the grammars, i.e., the set of ECMAScript assignments that create the semantic structure returned by the speech recognizer.


The Speech Synthesis Markup Language (SSML) is used to decorate textual prompts with information on how best to render them in synthetic speech, for example which speech synthesizer voice to use, when to speak louder or softer.


The Pronunciation Lexicon Specification (PLS) is used to define how words are pronounced. The generated pronunciation information is meant to be used by both speech recognizers and speech synthesizers in voice browsing applications.


The Call Control eXtensible Markup Language (CCXML) is a complementary W3C standard. A CCXML interpreter is used on some VoiceXML platforms to handle the initial call setup between the caller and the voice browser, and to provide telephony services like call transfer and disconnect to the voice browser. CCXML can also be used in non-VoiceXML contexts such as teleconferencing.


In media server applications, it is often necessary for several call legs to interact with each other, for example in a multi-party conference. Some deficiencies were identified in VoiceXML for this application and so companies designed specific scripting languages to deal with this environment. The Media Server Markup Language was Convedia's solution, and Media Server Control Markup Language was Snowshore's. These languages also contain 'hooks' so that external scripts (like VoiceXML) can run on call legs where IVR functionality is required.

There is an IETF working group called mediactrl ("media control") that is working on a successor for these scripting systems, which it is hoped will progress to an open and widely adopted standard.[4]

See also


  1. VoiceXML Forum Tutorial on VoiceXML 2003
  2. W3C Recommends VoiceXML 2.0 InfoWorld, Ephraim Schwartz, March 17, 2004
  3. http://www.w3.org/TR/voicexml21 Voice Extensible Markup Language (VoiceXML) 2.1
  4. mediactrl charter: Burger, Dawkins

External links


fr:VoiceXML it:VoiceXML ja:VoiceXML pl:VoiceXML pt:VoiceXML sv:VoiceXML th:VoiceXML

Personal tools

Served in 1.123 secs.