Machine Learning – Tips and Tricks: Part II
Part I of our series focused on Machine Learning provided insights on managing datasets. So now that we know how to train a model and fine-tune a balanced dataset, we can take a look into the evaluation of the model’s performance and some further ideas that could be useful to boost your intent recognition to state-of-the-art levels.
Evaluate your model
If you are using Machine Learning (ML) inside your intent recognition setup, the evaluation of the ML model should be a part of your performance reports. Teneo uses Advanced NLU with a Hybrid Approach for Intent Recognition and goes beyond standard approaches by making use of Dialog Management features such as the Trigger Ordering, please note that the performance of your Machine Learning model is not equivalent to the performance of your solution’s intent recognition setup. Please use Teneo’s Auto-test to see how your triggers perform in the complete setup.
1. Some Basic Concepts
Before we get started and look at the evaluation of an ML model, here are some concepts which you will recognize related to this topic.
To make this more intuitive, let’s say our classifier is trained to identify the class BOOK_A_FLIGHT, with user input examples such as ‘I want to fly to London’, ‘Need to book a flight to Barcelona’ etc.
What is Precision?
Precision = (True Positives) / (True Positives + False Positives)
This is a per class metric. Basically indicating, ‘if my classifier predicts class BOOK_A_FLIGHT, how often is this prediction correct?’ in order to calculate the precision of the class BOOK_A_FLIGHT.
For example, a new user input such as ‘can you help me to book a flight to San Diego?’ is classified by our intent classifier as BOOK_A_FLIGHT then we have a true positive result. If the classifier falsely classifies a user input such as ‘Need to cancel my flight to San Diego’ as BOOK_A_FLIGHT, then this would be a false positive.
What is Recall?
Recall = (True Positives) / (True Positives + False Negatives)
This is a per class metric. Basically indicating, ‘if my classifier has to find all utterances which shall be labelled with class BOOK_A_FLIGHT, what part of their total does it actually find?’ in order to calculate the recall of the class BOOK_A_FLIGHT.
For example, A new user input such as ‘can you help me to book a flight to San Diego?’ is classified as BOOK_A_FLIGHT results in a true positive, while ‘can you sell me a flight ticket to Los Angeles’ being classified as PURCHASE_MOVIE_TICKET would results in a false negative for Recall.
What is an F1 Score?
F1 = 2 x ((Precision x Recall) / (Precision + Recall))
This is a per class metric. Basically indicating, let’s take into account both Precision and Recall into one single indicator by using their harmonic mean .
The F1 Score makes it possible to talk in your project about the performance of your machine learning model by stating a single value – which looks nicer on a report and might be easier to explain to people in your company without knowledge around machine learning. You need for that the aggregated value across all your classes. This is usually the micro or macro average of the per class F1 scores. Now that we know what is being measured, let’s see how this is being measured.
2. Evaluation Methods
In Machine Learning, you will find different ways to evaluate your model. Some make use of a dedicated testing set, that you might need to create, some use other techniques such as cross-validation. In the following, you will find an overview of your options when using the Teneo Classifier and Microsoft’s Conversational Language Understanding.
2.1. Teneo Cross-validation
The great news is that you can evaluate your Teneo ML model by doing just a few clicks! In Teneo Studio, go to SOLUTION → Optimization → Class Performance → Class Performance Table
And then click Run
Now the Cross-validation feature is doing its ‘magic’. Behind the scenes the data is split in k folds and then each fold the k folds acts as a test set once while the other k-1 folds are used as training data. This is repeated k times such that each fold acts as a test set once. The results of all k runs are then averaged.
Therefore, this approach is also referred to as k-fold evaluation. Teneo uses 5-fold cross-validation, therefore every class in your dataset (see Class Manager) needs to have at least 5 examples, otherwise you will not be able to start the cross-validation.
2.2. CLU Split or Testing Set
In Conversational Language Understanding (CLU), you can either add yourself a dedicated testing set to your project or use the Data Splitting functionality to let CLU separate a certain amount of data from your training set and use it instead for testing purposes. You can select the amount of data that goes into the testing usage by selecting the percentage of it in the split.
You can find more details on the testing process in CLU here.
You will also find then in Language Studio the option Model Performance, where you can find the F1 score in the overview and access details around the class-specific values.
3. Interpretation of the Results
Both Teneo and CLU offer you interesting information around the results so that you can take the right conclusions and improve your model.
3.1. Teneo Classifier Results
Once you have run the Cross-validation, you will see a table which presents you with detail the results for all classes of your solution in terms of Precision, Recall and F1 score. Besides that you have also on the right the Conflicting Classes, which indicate you overlaps with other classes of your dataset by showing the classes which led to False Positives and False Negatives.
Please note that at the end of the table you find also the averaged scores across all classes.
You normally take here the Average F1 Score as indicator of your model’s performance (see lime green circle). If you would like to improve your model now, you can sort the class results for their individual F1 scores and focus first on the classes with the lowest performance. You can do this by simply clicking on F1 in top of the results table.
Now you check the Conflicting Classes of the lowest scoring classes and get a good indication which classes might need to get reviewed. Maybe your dataset does not contain good (or even wrong) examples for those or the classes are by design somewhat overlapping.
3.2. CLU Results
Similar to what we have seen in the last section for the Teneo Classifier results, CLU gives you also detailed information around the Precision, Recall and F1 score for all classes of your dataset. You can find this under Model Performance, together with the average F1 score for your model.
If you click here on the model name, you get the detailed view.
Additionally, you have access to a Confusion Matrix (click on Test set confusion matrix) which indicates you issues between intents your current model has. The dark blue fields show you correct matches, while light blue indicates confusion.
By analyzing the performance overview and the confusion matrix you can take insights on how and what you need to improve.
4. Format Converter
In case you have a dataset available in tsv format and would like to add it as training or testing data to your CLU project, make sure to check out this little converter here. If you would like to see converters for further formats, feel free to let us know in the Community Forum.
5. Confidence Scores
Since the Intent Classifiers calculate the most likely intent label based on a statistical approach, we do also want to know how sure the classifier is about its prediction. The intent classifiers provide us with a so called confidence score together with the identified intent label as result. We can then decide if in our project we want to take an action on the predicted intent or if we consider the confidence too low and prefer to ignore the prediction.
5.1. Teneo Classifier Confidence Scores
The Teneo Classifier distributes class likelihoods in a way that the total sum of all predictions is 1.0. The Advanced Tryout will display you a percentage view of this score (e.g. 0.7 will be displayed as 70.00%). Here you can see one example for the input “I would like to order some coffees here” from my demo solution.
5.2. CLU Confidence Scores
Let’s take a look how confidence scores are distributed in Microsoft’s CLU:
“I would like to order some coffees here”
Also in CLU the confidence scores range from 0 to 1.0, but only the Advanced model has the restriction that the scores of all intents need to sum up to 1.0 for each input. You can take a look on this yourself by going in your Language Studio project to Testing Deployments and inspect the JSON result which shows you the detailed distribution for all classes. Below you can find an example for an Advanced (multilingual) CLU model, which sums up to 1.0.
5.3. Confidence Threshold
In Teneo, you can set the default minimum confidence score as a general setup for all triggers. You can find this option in your Solution Properties under Confidence Threshold. Class Match Requirements of a certain class only evaluate then as true if that class is the most likely prediction and if the confidence score is above the minimum confidence threshold.
You have also access to the Confidence Threshold Graph which can help you to find the best threshold for your project. The Graph displays you how the selected threshold impacts your Precision, Recall and F1-Score. Generally speaking, if you increase the threshold you will also increase the Precision of your Classifier’s predictions since you only accept matching predictions with a higher confidence. This will affect negatively your Recall though, since correct predictions might not trigger a matching process anymore due to having a confidence score below the threshold. The Graph can help you here to take the right decisions for your data and triggering setup.
Please note that you can set a confidence threshold also for individual triggers. This can be done by means of a Global Scripted Context or by using a Predicate Script. In this way, you can create triggers which make use of Class Match requirements with different precisions. The Trigger Ordering lets you sort them into adequate groups and “fire” the triggers according to their overall precision.