Contact Center Solutions Featured Article

Vestec on Contact Center Speech Recognition Trends

March 02, 2011

Automated voice solutions: chiefly speech recognition (speech rec) is hot, and for several good reasons. Steadily improving technology and innovative, often vertical-targeted applications made them easier and friendlier to use on both inbound and outbound calls, resulting in lower costs and higher customer satisfaction. At the same time, the mobile revolution makes speech rec the only practical hands-free means -- for safety and law-abiding as well as convenience reasons to obtain single and essential information.


Vestec provides a standards-based speech rec engine for a wide variety of “command-and-control” type deployments. It also offers a sophisticated NLU (Natural Language Understanding) engine for use with third-party speech recognition products for natural language “say anything” call-steering applications.

To assess automated voice/speech rec trends ContactCenterSolutions.com recently interviewed Fakhri Karray, co-founder and ceo and Kashif Kahn, co-founder and vice president, business development, Vestec.

ContactCenterSolutions: Outline what is happening with the adoption and use of automated voice solutions in contact centers:

(a) What percentage of calls are being completely handled by voice self-service (DTMF and speech rec) as opposed to live agent now compared to say two to three years ago?

FK: Among our customers, we are seeing nearly 100 percent of call volume being handled by voice self-service, whether DTMF or speech recognition. In fact, to the best of our knowledge, virtually all medium and large enterprises use some form of voice self-service on account of the complexity of their business processes and desire to reduce operating costs.

We are also witnessing an increase in the proportion of calls being handled by voice self-service among small businesses that have historically opted for human agents. However, the overall percentage of calls processed by voice self-service systems among small firms is still well under 100 percent

(b) What change has there been in the percentage of these voice self-service calls handled by speech rec versus DTMF?

FK: The last two to three years have been a most unusual period on account of the unprecedented financial crisis and the resulting economic slowdown. So, we need to be careful in comparing trends over this time frame.

Before the onset of the financial crisis, there was a clear trend in favor of speech recognition in voice self-service systems. Companies increasingly preferred some form of speech recognition to traditional DTMF in their contact centers and the proportion of calls being handled by speech rec were growing.

With the onset of the financial crisis, deep cuts were made to IT budgets at every company while the criterion for speech recognition applications was considerably tightened. As a result, speech recognition lost the momentum that it had built over several years leading to the financial crisis.During the past six to eight months, we have seen a revived interest in use of speech recognition as companies loosen their purse strings and become more confident of economic recovery. However, IT budgets are still relatively tight and ROI criterion is still a major challenge to work with. Therefore, even though adoption of speech recognition will likely outpace that of DTMF going forward, it will be a while before it grows at the rate before the financial crisis.

Eventually - and this could be years away - we expect speech recognition to completely eclipse DTMF.

(c) What changes have you seen in call completion rates i.e. started in voice self-service? What are the breakdowns between DTMF and speech?

FK: Generally speaking, call-completion rates in voice self-service have been trending up for some time. This is almost entirely due to improvements in them for speech recognition applications; those for DTMF applications already approach 100 percent for most deployments.The improvement in call-completion rates in speech recognition applications is a consequence of three forces: (a) higher recognition accuracy, (b) better application design, and (c) greater customer adaptation. Recognition accuracy of acoustic models has been trending upwards for several years now, resulting in better performance for both native and non-native speakers across both mobile and VoIP channels. Speech application designs have also improved, generating a more intuitive caller feel and greater ease-of-use.

Finally, with proliferation of speech applications - especially among major consumer services firms such as banking and telecom, whose product offerings necessitate regular customer contact with IVR systems - callers have become more adept at using speech recognition. That is, speakers have leveraged their prior experience to change their interaction styles and keyword vocabulary to what works, thereby improving call completion rates.

(d) Any figures on the changes of percentages of hybrid automated voice/live agent calls handled by the DTMF or speech?

FK: The trends here are mixed and vary by industry sector and application type. It used to be that most firms considering self-service wanted ultimately to replace human agents, and so the proportion of calls being handled by automated systems continued to increase for years.

Then depending on the industry sector, companies started realizing that they could not entirely eliminate live agents. And that for certain type of services, live agents were not only desirable but necessary. In fact, live agents - by providing a personalized "human touch" - could help some firms gain an edge in service intensive industries. And so, a number of models have evolved according to industry sector and business processes that vary from fully automated systems to hybrid automated/live agent systems to live agents only.

That being said, very few firms provide direct access to live agents even when they rely heavily on humans for customer service. It is generally accepted to use some form of automated call-routing via a menu to determine the nature of the call before transferring caller to an agent.

In recent years for natural language (i.e. "Speak Freely"/ "Say Anything") hybrid speech/live agent applications, we have seen some movement towards reduction in role of speech recognition in favor of DTMF. This behavior, of course, is difficult to generalize as it is highly dependent on local culture as well as customer feedback. In Canada, for example, some banking firms have switched from hybrid speech recognition/live agent systems to hybrid DTMF/live agent systems on account of customer preference for DTMF menus. As another example in Canadian telecom sector, there has been a redesign of hybrid speech recognition/live agent systems among firms in one of two ways. Some companies have eliminated the "natural language" speech component in favor of keywords-based recognition while others have complemented their existing keywords-based systems with traditional DTMF for greater flexibility.

ContactCenterSolutions: What are the drivers of these changes? Please explore and discuss:

(a)  New methods and technologies, and refinements to existing onesKK: We believe advances in artificial intelligence are creating a paradigm shift in both speech recognition and speech understanding (i.e. semantic interpretation of recognized text). These advances are improving customer experience through more accurate recognition, especially in noisy environments and with accented speakers and more accurate interpretation, especially with natural language "Speak Freely"/"Say Anything" grammars. Developer time and costs are also being reduced through lower priced products and more robust tuning tools.It should be noted that formation of industry wide grammar writing and platform integration standards as well as growing availability of application code that can be re-purposed is also helping reduce solution costs.

(b)  Pricing. Have the prices dropped over the past two to three years and will they continue to drop? Please provide rough dollar amounts.KK: There is no question that speech recognition is becoming more affordable. That being said, there appears to be a bifurcation in the speech market. On the one side, there is a giant company that has grown through acquisitions and has a huge customer base, especially among enterprise firms. It obviously wants to maintain its premium pricing by presenting its products as the ultimate choice.

On the other side are a number of small firms that are leveraging proprietary technologies to offer high quality products at substantially lower prices relative to Nuance. Vestec is one such firm and our ASR (automated speech recognition) products have been recognized as "Top 25 VoIP Advances" precisely for their contribution in making speech recognition truly affordable.Going forward, there will continue to be downward pressure on speech recognition software. Mostly this is a function of advances in technology that are reducing product development time and effort as well as growing willingness of young firms to offer speech services based on ASR products by new vendors. Downward pricing pressure on speech-enabling products is also being created by availability of useful free-of-charge speech services - such as 411 type services - from the likes of Google and Microsoft.

(c)  Installation time. Has this dropped and if so from what to what over the past two to three years?

KK: Software installation is becoming progressively easier. Better software design and product standardization are partly responsible for this. Simplification of the icensing regimen via introduction of machine specific licensing as opposed to remote server authentication is also reducing software installation complexity.

(d)  The advent of hosting. What difference if any has the hosting model made on self-service speech rec completion rates, costing, installation time and demand?

KK: Hosting is a major force in speech services and is clearly having an impact on speech usage and acceptance. Hosting is an attractive option for a number of firms that do not want to build in-house speech departments or do not have the ability to manage speech infrastructure or are interested in trying speech recognition for the first time or a limited time. Speech application time-to-market and deployment costs are reduced by hosting while customer satisfaction generally increases on account of quality guarantees and support availability. That being said, the impact of hosting services on long-term ownership costs is not clear. Firms experienced with speech software and telephony infrastructure may find it more convenient--and considerably cheaper--over the long term to do in-house application development and infrastructure maintenance. One can see this in "software as a service" pricing models for renting speech recognition engines; the monthly ASR renting fees from some vendors - when totaled over a few years -  typically exceed the one-time perpetual licensing software costs.ContactCenterSolutions: What changes have you seen if any in the payback period?KK: There are two major trends here. For speech applications that are well understood and can be implemented in a standardized manner, the payback period is slowly decreasing. This is largely a function of experience, not only in terms of avoiding budget overruns and time-to-market delays, but also in terms of better introducing the application to customers for maximum impact. On the other hand, there is still considerable confusion about actual payback periods for novel speech applications. This is especially the case for complex natural language applications that are meant to allow customers to speak in a conversational manner.ContactCenterSolutions: Is it not so much advancements to the speech engines but in developing specific applications for particular verticals i.e. government, healthcare contact centers that is making speech rec more viable?FK: This is not entirely true. Speech engines are continuing to evolve not only in terms of underlying recognition and interpretation technologies but also with respect to grammar development, grammar tuning and platform integration tools. And judging from the fact that speech engines cannot yet compete with humans in recognition accuracy or semantic understanding and that the high levels of expertise continues to be required for their proper usage ASR engines have quite some ways to go.That being said, specialized applications development for verticals has clearly had an impact on improving and popularizing speech offerings. This is obviously a consequence of specialization: by focusing on market or product niches, developers have been better able to focus their resources on addressing issues as well as on better leveraging their experience and expertise. In addition, niche focus has also made product pricing more attractive as firms find it easier to spread their development costs over multiple sales opportunities in the same sector.

ContactCenterSolutions: Answering services predate modern contact centers by decades with operators answering calls in 45 seconds or less. Have there been and are there automated speech applications in the works to reduce the call volumes they handle?KK: There is definitely a trend in the speech industry to develop specialized products and applications for niche segments. Witness the availability of customized speech transcription software for legal and healthcare industries.  That being said, availability of new applications is very much dictated by their development costs, revenue potential, and customer acceptance. As business processes being handled by answering services become more standardized, one will definitely expect to see turnkey applications for tasks that are structured, common, and take short time to execute.

ContactCenterSolutions: What effect has customers: consumers and businesses going mobile have had on the demand for and features with automated voice solutions? Is the demand and usage of them on mobile devices displacing live agents? Or are they more likely displacing web self-service?KK: Mobile speech services are having a huge impact on consumers and businesses alike. For consumers, mobile services are helping popularize speech applications and opening new vistas in use of speech. A prime example is location-based services popularized by Google and offered free-of-charge. Customers can find all sorts of information about shops, restaurants, movies, parking, weather, subways etc. by talking to their phones. As another example, consider handset based name-dialing and voicemail-to-e-mail services. People increasingly find it more convenient to speak the name of their contact when dialing as opposed to using their telephone keypads for inputting the required number. Similarly, it is more productive for some to skim an email representing the text of a voicemail than having to listen to the entire audio message.For businesses, mobile services are creating new revenue opportunities as well as decimating existing business models. Voicemail-to-e-mail services and location-based services, for example, are both major new revenue opportunities.

On the other hand, it is becoming increasingly difficult by service providers to charge for traditional 411 type services when such information (and more) is being made available free-of-charge by new entrants. It should be pointed out that growing sophistication of speech-based mobile services is also exerting pressure on large enterprises to improve the robustness and sophistication of their contact center speech applications.Mobile speech applications are also influencing live agents usage and traditional web services. If required information can be obtained from a speech driven location-based service, there is no need to call a 411 operator or speak to an agent at a service provider. And since speech is the most natural and effective communication medium, it follows that the more useful information customers can find with speech applications, the less need they will have for keypad/keyboard driven traditional web-based services.ContactCenterSolutions: Discuss directed dialogue versus natural language speech rec. What would you say are the rough splits between them, have they changed and if so why and if not why not? What tasks do they each do best at?FK: Let me begin by saying that Vestec is heavily involved in both natural language and directed dialogue implementations. Our NLU engine works as an add-on to third-party speech recognition engines and significantly reduces the time and costs of developing natural language grammars. Meanwhile our ASR engine is designed for keywords-based directed dialog applications and has been recognized a Top-25 VoIP Advance for making speech recognition truly affordable.With this background, our view remains that natural language implementations are limited to a small segment of large enterprises that provide consumer services and experience high call volume. Generally speaking, such firms belong to one of the following four major verticals: telecom, banking, utilities, and travel. On account of their deep pockets, they have the financial wherewithal to afford high development and maintenance costs of natural language solutions, while their large call center traffic puts them in a good position to earn a return on their investment through savings generated by these solutions. On the other hand, directed dialogue systems are considerably easier and cheaper to deploy than their natural language counterparts. It is no wonder, therefore, that they are not only the most common speech applications in the market but continue to be the first choice for most firms considering speech recognition.

However, we are seeing a trend in favor of natural language on account of introduction of location based services by the likes of Google. But then it is common knowledge that Google can afford to ignore the true cost of developing such natural language services by subsidizing them with revenue from other business areas, a luxury that is not available to most firms considering natural language solutions.The primary use of natural language technology is in call-routing. This is certainly the case with Enterprise deployments where a natural language interface serves as a "front end" for determining the nature of the call. Following that determination, the call can be handled by either a live agent or a specialized directed dialog system. By contrast, directed dialogue technology can be utilized for both call-routing and self-service applications. A customer can replace a traditional DTMF menu with directed-dialog interface for routing purposes as well as design a directed dialogue based self-service system for executing a multi-step business process.

ContactCenterSolutions: What percentage of calls currently being handled by live agents that you believe automated speech solutions can potentially answer? In contact centers adopting speech rec: those that have been reluctant to do so because high costs, long lead times and payback periods? What are the obstacles in the way, how can they be overcome and what methods are you seeing or working on to remove them?

FK: In theory, virtually all forms of customer interaction can be speech enabled. In practice, automated speech solutions are generally utilized for business processes that are highly structured, very common, and take small amount of time. This is obviously a function of not only effort and expense required to build speech applications, but also of caller time and patience available for interacting with speech applications. Given this reality, we find the most common usage of speech recognition to be for call-routing (as companies replace traditional DTMF menus for their customer service call centers with speech). This is followed by name-dialing applications where firms replace an operator or traditional DTMF-based corporate directory with speech. At a distant third are self-service applications for specific business tasks. These depend upon industry verticals and can range from balance inquiry in banking to bill payment in telecom.

Going forward, we believe vendors need to reduce speech solution ownership costs and speech application time-to-market. This can be accomplished via reduction in software prices as well as development of easier-to-use engines and more sophisticated tuning tools that reduce professional services needs.  At the same time, vendors need to better clarify benefits of speech recognition by cutting back on hype. Quite often customers have difficulty separating fact from fiction. Worse, nearly everyone has heard horror stories about speech deployments that did not perform as expected or experienced massive budgetary overruns.

ContactCenterSolutions: Is there still a role in contact center self-service for DTMF and if so what is it?

KK: There is absolutely a role for DTMF in contact center self-service. For starters, most companies utilize DTMF as a "fall back" provision to their speech-based self-service systems. That is, callers are transferred to a DTMF menu after they have failed in their attempts to get understood by the speech system. And this DTMF "fall back" mechanism is typically utilized irrespective of whether the speech system in question is natural language or directed dialogue based. On the other hand, there are still a large number of firms who have not migrated to speech and utilize DTMF. And judging from various surveys, a number of them are satisfied with the performance of their DTMF systems and do not plan to migrate to speech in the foreseeable future.


Brendan B. Read is ContactCenterSolutions’s Senior Contributing Editor. To read more of Brendan’s articles, please visit his columnist page.

Edited by Tammy Wolf



Home