Voice User Interfaces - Challenges and Solutions

We all know how a quick call to clarify something is much more efficient than sending an email and having to wait for a response for hours on end…We’ve all been there and even though many of you may prefer getting things in writing, there isn’t a much better way than to speak to somebody directly.

A clear reason to speak over the phone is the immediacy of response and lack of waiting time and the preference for voice over text interaction happens in many communicative situations when online users are engaging with a Conversational Interface (CI).

Voice interaction via Conversational Interfaces can and should be as effective as text interactions if we want to design conversational experiences that recreate human conversation properly. As we have explained in a previous article, the differences between text and voice can represent a challenge for the development of a CI.

Voice inputs are varied compared to text inputs as users’ characteristics, feelings, language variety and intent have a wide range. Simply put, you can’t use the same method for creating voice responses as you do for text input.

In this article, we are going to discuss the most common challenges CI designers face when building natural language interactions, and how they can be faced and solved from a conversational experience perspective.

User expectations

Let’s start by tacking the most difficult of challenges…

CI is supposed to understand our users’ inputs. But clearly, the way we use language when we speak is not the same as when we write.

When we write, the process of creating our message has a different timing, so normally there is plenty of time to elaborate a sentence or look for the specific words that we would like to use. But this is a completely different process when we talk.

The immediacy of this situation generates incomplete sentences, omission of the use of a subject in each sentence, rephrasing parts or even assuming contextual information. So, using perfect examples to train the intents or our CI will not be helpful for understanding (all) voice inputs.

Essentially, you need to consider that users will NOT be talking properly. Therefore, training your solution to recognize these imperfections will allow a good response from your solution.

You will need to consider some of these specific differences and use a wider variety of examples for the NLU training. Also, making use of the Language conditions as additional match requirements will allow you to perfectly create the natural language understanding you were looking for.

Using this strategy will improve the understanding of our NLU and improve user’s expectations.

Cognitive load

This concept can be described as:

” […] short-term or working memory has a limited capacity and can only handle so much information effectively at one time.” Lori S. Mestre, in Designing Effective Library Tutorials, 2012

In other words, in contrast to a text interface that will have some kind of content display, a pure voice interaction would lack the possibility of being reviewed or navigated through by a user. This means that the way you provide information through a voice-controlled CI should be designed specifically for voice.

When defining the content that will be provided by voice, build your message respecting a sequential order: that is, starting with the general context and then providing the action required from your user at the end. This would be a straightforward way to facilitate the understanding of your sentences.

Making use of contextual information and avoiding redundancy may seem difficult, but in Teneo you can easily use a Global variable to store the key pieces of information that you want to reuse later. Then you will only need to refer to the variable in the answer to reuse this information in different scenarios. This partial repetition of the words in the user input will facilitate the understanding of the message without sounding repetitive.

For instance, look at this example:

In this case, we extracted the city name, Barcelona, and stored it in a Global variable. As the user changes the subject to booking flights, we could make use of the previously mentioned city to provide a personalized and context-based suggestion. By doing so, our interaction provides a small amount of contextual information.

This is an example of how using Global variables can help you reduce the cognitive load of the CI content.


As mentioned before, voice interactions can be different from the text ones. Not all use cases or solutions can be easily transferred from a text based channel to a voice one. You can try and adapt your current solution in Teneo to other specific VUI channels such as Alexa or Google Assistant. Using these connectors, you will be able to easily experiment with how your original design works on a VUI channel and which changes you may need to implement.

To enrich your outputs with pauses, emphasis, or volume, you will need to adapt each message using Output parameters, making sure you adapt the behavior oy the CI in case the channel used is, for example, Alexa.

In this article we presented some of the challenges that voice interaction can present to conversational AI developers and specific strategies to implement a solution within Teneo.

Have you ever faced something similar? What was your approach? In case you want to have an open discussion about this, remember we have our forum available :blush: