OpenQuestion - Voice Challenges from the UX perspective

Creating a seamless and intuitive voice user experience can be challenging. Throughout the process of creating and designing OpenQuestion, we faced and successfully resolved numerous situations commonly encountered during the voice user interfaces.

In this article, we’ll be sharing our key learnings from this process in order to support developers and designers in building their own voice experiences.

Natural Language Understanding

While building a Conversational interface, one of the challenges is understanding. In the end, the capacity of the model to interpretate what has been said by the user will determine the quality of the Conversational Experience.

Although the main solution to this challenge is focused on the optimization of the language model, sometimes this is not enough. In the case of voice interaction, we are not only facing misspellings or typos on our user inputs, but more interesting phenomena: the speech to text (STT) transcription.

This speech to text conversion is based on an additional Machine Learning model trained to convert the sound or audio signal into text. Once this textual transcription is received, our language model starts its work with the intent recognition. The question is, what can we do in case the message was misunderstood?

  • Firstly, the type of error must be analyzed to identify the origin of the mistake. That is, if the message was misinterpreted because the transcription was incorrect , or we are facing a classification problem by our language model.

  • Secondly, and once we have identified the reason behind the possible misunderstandings, we need to act based on that. In case it is a problem with our model performance , we will need to retrain the model and/or review the different triggers that are present in our solution: TLML, LOs, Hybrid… etc…

  • Lastly, review the ordering applied to all the triggers and its groups. In the case it has to do with the Speech Recognition, we have another series of alternative strategies that we describe below.

Feel free to investigate the following two articles. Here we include some strategies for improving machine learning models in depth:

Speech Recognition

In the case of Speech Recognition, we are dealing with a different technology as the one mentioned previously: Speech-to-text . Since the STT technology is based on a statistical model, it needs to be reviewed and optimized individually. But that can be a challenging as it is often not controlled nor manipulated by the same team. The question is, what can we do in case the STT transcriptions contains mistakes?

Naturally the first option to consider would be to improve the speech model, but as mentioned, this process may present additional dependencies, such as the speech services provider or the team involved. Here, we compile two alternative strategies that can mitigate the possible transcription mistakes:

  • Global Scripts: The ability to normalize and interact with the received transcription before the model intervenes is crucial. In Teneo, you can make use of the Pre-Processing or the Pre-Matching global script. Both scripts can be used to modify and fine-tune the transcription received assuring the needed consistency for a seamless and controlled interpretation of the received user inputs. If you want to learn more about the Global Scripts, you can explore our article Understand the processing Order of Global Scripts.

  • Homophonic terms: Simply training your model does not ensure 100% correct performance. For example, there are words and phrases which pronunciation and phonetic transcription are similar (invoice /ˈɪn.vɔɪs/ or in boys (/ɪn bɔɪz/) or even the same (buy, by and bye /baɪ/). But thanks to Teneo, you can solve any case, no matter how particular it may be by creating your own Language Objects and selecting a specific list of “sound equivalencies”, you can easily substitute the term that was transcribed incorrectly and avoid semantic mistakes.

Our OpenQuestion implementation package contains further details on how to face the Speech Recognition challenge among other significant optimization guidelines. In case you do not have access yet, feel free to contact us here.

Context awareness

Users expect the voice assistant to understand the context of their requests and provide relevant responses. If the voice assistant fails to recognize the context of the conversation, it may provide irrelevant or incoherent responses, which can lead to user frustration and a poor experience. In this case, the problem we are facing could be posed as: what can we do if some information is missing?

It is key to implement different conversational strategies so you can mitigate the lack of context. For instance, let’s imagine the user omitted important information like in “I want to activate this service”. We can understand the user is willing to activate something, but unless the user is logged in and we can consult the last service that was contracted through an integration, we do not have that information. It is at times like this that disambiguation is necessary. It is always a good idea to include Clarification flows so we can make it easier for the user.

To obtain the information and design our clarification flow we should consider different aspects. A good approach to describe the considerations would be following some of the Grice’s maxims as guidance, as the principles by themselves can be considered partially vague:

  • Maximum of relation - Acknowledge the understood, let the user know that you have understood part of the message, allowing you to speed up the next interaction and focus on the lost point. For example, “I understand you want to activate a service, could you please specify a little bit?
  • Maximum of quantity – Cognitive load for voice interactions: bear in mind that humans are not capable of remembering too many elements, especially during a conversation. In our example, in case there were a total of 3 or less services that could be activated we could include them into our answer: “Do you want to activate X or Y?
  • Maximum of quality – Consider only business disambiguation: this means it is not necessary to create a disambiguation flow for each possibility we can have in our model, but only in those cases were depending on the option, the difference in response is substantial.

Pragmatic Phenomena

Although the concept of pragmatics can sound a bit unfamiliar, it is something that directly applies to interactions with conversational interfaces. After all, these are based on the most common interpersonal interactions of all: colloquial conversation. Within this type of interactions, especially in voice, we can encounter multiple situations and elements that will affect the correct communication. The question is: what can we do in case a situation affects the conversation?

This does not mean that you must think through and design specific flows for each one. Within the pre-packaged OpenQuestion solution we have included multiple flows according to the different situations that might occur during a voice interaction.


In this article we have shared different challenges and strategies to mitigate and overcome the most common voice challenges while building OpenQuestion. Creating effective conversational AI solutions that provide an optimal user experience requires a combination of technical expertise, user-centered design, and ongoing iteration and improvement. All of this has been considered within OpenQuestion, which can further enhance the effectiveness of the conversational IVR experience.