Teneo x GPT – Every Second Counts
Intro
The benefits of using Large Language Models within Conversational AI solutions have been and are still being heavily explored – you can find inspiration in our first Teneo x GPT article here: Teneo x GPT - Better Together - Knowledge Articles - Teneo Developers Community
While it is obviously a lot of fun to go through the different functionalities that GPT can help with, further important topics should also be on your agenda. In the following article, we take a look at one of these – latency. GPT models are being made available by Microsoft on Azure via an API that you can call with a Prompt indicating which task needs to be solved and what the context of the request is. GPT models can provide us then with really impressive results but until we can work with this result some time goes by – in terms of the API calls a connection to the API has to be established and a response has to be received. In best case both (the complete API call) happen in less than a second but what happens if not – if it takes several seconds. And what if we are not only talking about a single call to GPT per bot interaction but about several calls.
The user experience is crucial. In OpenQuestion, our solution which has solved the contact center routing, we are talking about a human to bot conversation on the phone which shall be conducted in a completely fluent and natural way. As you can imagine, in such a setup – every second counts.
Baseline
Typically, one call to the Azure Open AI API takes between 0.2-0.8 seconds for the type of interactions we are discussing here in this article. User inputs come from the contact center world and can be here similar to ‘My phone is broken again’. They can then be analyzed for further processing by means of GPT calls. Latency-wise, there are though occasionally outliers which can take more than a second (or possibly even several seconds in worst case). To better understand the latter, let’s say we use GPT models in our project for several tasks within one bot interaction, e.g. Sentiment Analysis (call 1), Queue Classification (call 2) and Entity Recognition (call 3).
Please note that all these tasks can be natively handled by Teneo features in many different languages without the need of adding an API call and with (almost) no latency.
The following table shows latency data of five selected interactions with our bot, all using GPT-3.5 from a European Azure Subscription for the model and a bot application also deployed on European servers.
The outlier (over 1 second) might become problematic for the user experience that we want to create. Furthermore, even regular performing calls can lead to a latency we want to avoid if accumulated (see Total in the table). Thus, we have to solve challenges of two types, accumulated latency in a sequential setup and outliers which might take too long on their own.
Latency
Let’s zoom in on test #1 of the previous table:
A sequential execution of the three mentioned GPT API calls would create a latency in our bot response of 1.39 seconds – and that’s only for GPT calls. In the following, we will discuss solutions for this sequential latency and the previously mentioned outliers.
Tip: Our GPT Connector provides together with the response also the latency of the API call, e.g. gptHelper.output().requestDuration
Multithreading
We have just released an update to our GPT Connector which makes setting up multithreading for GPT calls ‘a piece of cake’. Why is that such a great thing? Let’s look at our previous example:
In a sequential setup the latencies for even regular GPT API calls (see test #1 - #4) can amount to only ‘okayish’ response times of your bot (grey in table), while a simultaneous setup reduces the total latency time to the one from the longest single call.
It’s as simple as this. By running our GPT calls simultaneously, we can reduce the total latency in our example #1 by 59.6 % (or seeing it the other way around a sequential setup for the calls increases the latency by 147.6 %).
If you have several GPT calls in your project which can be run at the same time, make sure to set them up as multithreaded processes, our GPT Connector tutorial runs you through all details and makes this easier than ever.
The following images exemplify the differences between a sequential and a simultaneous GPT setup:
Outliers
We still have to talk about the outlier we discovered at the beginning of the article. What can we do if a GPT API call takes several seconds? In our example #5, the Sentiment Analysis call to GPT took itself more than one second. In a phone experience, as in OpenQuestion, you want to create a natural experience for the user and avoid long response times. Therefore, it is important to have control over the whole latency that a response from your bot can have and you might want to choose how long you can wait for a specific GPT call depending on the importance of its task for the current interaction with the bot. We have therefore included the option to customize the timeout settings for each instance of the GPT Helper which you use in your project.
Timeouts
First of all, we need to decide how important a specific GPT call is for the current bot interaction and the resulting user experience.
In case our Sentiment Analysis takes longer than one second, do we really need it for the current interaction? Or should we give preference to a low latency?
Tip: To mention here also that Teneo comes with a native sentiment analysis for six languages, with basically no latency for the interaction, see: Sentiment & Intensity Analysis | Reference documentation | Teneo Developers.
So let’s say the importance of our Sentiment Analysis GPT call is not crucial and therefore we would only like to include this call if it does negatively impact the overall experience of the user with our bot.
The GPT call we use for the queue recognition might be crucial for your setup and we would want to wait for it for a few seconds, even if we would recommend to benefit here of Teneo’s TLML layer which could be used as the fallback and keep latencies low.
We had this challenge also in mind when designing the GPT Connector and made it easy for you to set up each instance of the GPT Helper (and with that each task) with a customized timeout to adjust your project’s setup to the best performance while keeping a top-notch user experience.
Example for the instantiation of a Sentiment Analyzer by means of the GptHelper:
As you can see, the two last arguments of the GptHelper initialization are the connection timeout and the response timeout and can be easily customized per task you run via GPT in your solution.
Select the right region and model
Make sure to select an Azure region for the usage of the OpenAI APIs that fits to the deployment of your project. If you are based in Europe, for example, and your bot is also deployed in that region, it makes sense to use also a European-based Azure service to avoid additional latencies due to the region. This has been a topic in the past since GPT-4 was first available only in the US. At the time of writing this article, June 2023, it has been announced for availability also in France Central as European-based region. You can find the current model availability here.
The following test series has been run from a bot environment hosted in Europe running against a US-based Azure OpenAI endpoint.
As you can see, the latencies have overall increased and we have an outlier with even over 4 seconds of latency.
Tip: GPT-3.5 tends to have lower latencies than GPT-4 and is also much cheaper, see Azure GPT Pricing. Make sure to select the model that fits best your requirements!
Conclusion
The usage of GPT models adds many new possibilities to Teneo projects. It is crucial though to know the advantages and disadvantages that come with it in order to add really value to your solution.
It is worth to note that this article here is a snapshot taken in June 2023 and that services evolve day by day. We will see new models, new accuracies, new latency times in the future. The important point is that we have the tools at hand to react on this evolution accordingly and guarantee the quality of our project.
Shoutout also to my coworkers Alexander Perekrestenko (@Alexander) and Chun-Lin Wang (@chunlin.wang) for their implementation of these nice functionalities into the GPT Connector which makes it basically a ‘plug’n’play’ for everybody who wants to use GPT in Teneo – because every second counts if you want to provide a brilliant user experience in the contact center.