Fuzzy Search in Teneo

chunlin.wang · 8 March 2022 10:21

Fuzzy Search, which means finding matching items in a certain list based on a search pattern provided by the user, is very widely used among the Conversational AI projects created on the Teneo Platform. Imagine a project scenario in which users could choose via the chatbot between 10,000 different products in a database, and many of them may have very similar names. Often the user will not even be aware of all the similar naming possibilities and might also not refer to the product by its full and complete name. In this case, a Fuzzy Search algorithm can extract the keywords from the user input and compare those keywords with all products in the database to find out the most relevant products.

Teneo.ai provides you Groovy code which allows you to implement Fuzzy Search easily in your solution. This article is a detailed guide for you.

1. Environment Setup

In order to add Fuzzy Search functionality to your chatbot, first you should download the file FuzzySearch.groovy from this page, and import it in your solution as a resource file. Check this page if you want to see how you can add resource files.

After adding this Groovy file to your solution, you can start building your own fuzzy search flow. The fuzzy search consists of the following steps: Pre-processing, Searching, Post-processing and (possibly) a Multi-result Follow-up. You can also download a demo solution here, which contains example flows to do a Fuzzy Search on Airports in United States.

2. Pre-processing

In the Pre-processing phase you should prepare both the search pattern and the item list. The search pattern is usually taken from the user input. As best practice, we strongly suggest you remove stop words from the user input. Suppose that you are building a fuzzy search flow for Airports in United States. When a user says “Search for airport in New York”, it’s better to remove the words “search”, “for” and “in” from the user input than using the whole 6 words as search pattern, because these three words are commonly used in any kind of intents related to searching and have nothing to do with locating the airport the user is asking for. A common stop words list can usually be found in the Lexical Resources. Furthermore, as all airports contain the word “Airport” in their names, you can also remove this word from the search pattern in the airport searching context. This kind of stop words are called project specific stop words, which should be stored in a list created by yourself.

As for the item list, usually you need to prepare two lists. One list contains the (normalized) names which are displayed back to the user . In case of an airport search, you should put the official names of the airports in this list, for example, John F. Kennedy International Airport, Seattle-Tacoma International Airport, etc. The second list contains the search strings of each item, which should include all the possible keywords user may use to search for this item. For example, the search string for John F. Kennedy International Airport should contain not only the keywords in its official name, but also the city name “New York” and the abbreviation “JF Kennedy” or “JFK”, because it is very possible that the user uses these keywords instead of the official name to refer to this airport. Finally, you need to create a mapping between the name list and the search string list . When one search string has been matched in the searching phase, this mapping will help you to find the official name of the item by this search string-name map.

In some case you can also directly use the name list as the search string list . If you do this, you will not need to create name list-search string list mapping.

The data source of the item lists could be an Excel sheet stored as a resource file of your solution. Check this article and the comments below it which can help you with reading files from Resources. If the amount of your data is not large (like hundreds of airport names), you can also store them directly in a global variable. In this article a small database of airports in United States stored as global variable is used:

Each airport has a key called name which corresponds to the official name of the airport, and a key called search which is a combination of name, city, state, and IATA airport code. The “name” key will be used to create the name list in fuzzy search, while the “search” key will be used to create the search string list .

After preparing the search pattern and the item lists, you will need to apply another very useful pre-processing technique which is called normalization . Normalization usually includes several steps, for example remove punctuations, convert all letters to lower case, remove repeated words and replace special words. Please be aware that you should apply the same normalization rules to both search patterns and all the items in the search string list (but not to the items in the name list).

In most of the cases the punctuation marks should not be a part of the search pattern and the search strings. You can use the following code to remove punctuation efficiently:

stringName = stringName.replaceAll("\\p{Punct}", " ")

Use the following code to convert all words to lower case:

stringName = stringName.toLowerCase();

And use the following code to remove repeated words:

stringName = stringName.split(" ").collect{it as String}.unique().join(" ");

Finally, sometimes you need to replace some words in the search pattern. For example, when the user says NY or The big apple in the input, although we know that the user refers to New York, but neither the abbreviation nor the nickname can match any word in the search string (suppose that the search string is “jfk new York john f kennedy international airport” ). What you need to do is replace the abbreviation NY and nickname The big apple with New York in the search pattern.

You can add more normalization rules when needed. We recommend you create a class in Globals > Scripts > Solution loaded and put all normalization codes there. Then you can call it in the script nodes or listener scripts in your fuzzy search flow.

After the Pre-processing is done, you can use the search pattern and the search string list to run the Fuzzy Search.

3. Searching

In the Searching phase you just need to fill in the search pattern and the search string list (as searching candidates ) to the searching method defined in the Groovy file FuzzySearch.groovy. Three fuzzy search algorithms are included in this code: Word Count, Cosine Similarity and Edit Distance.

Option A: Word Count Search

Word Count search is based on the number of shared words between the search pattern and each candidate. It collects the candidates which have most shared words with the search pattern. For example, if the user input is new york jfk, it has three shared words with the candidate “jfk new York john f kennedy international airport” and two shared words with the candidate “lga new york laguardia airport ”, so only the former one will be returned as searching result.

To do a Word Count search, you need to fill in the search pattern, the searching candidates, and a threshold (optional). This threshold represents the minimum amount of the shared words that allow the candidate to be counted as a match. For example, if you set the threshold = 2, those candidates which has only one shared word with the search pattern will not be considered a match. The default threshold is 1, and usually does not need to be changed. Below is an example of the code to run the Word Count search:

resultList = FuzzySearch.mostSimilarByWordCount(String searchingPattern, List candidates, int threshold)

Option B: Cosine Similarity Search

Cosine Similarity Search is based on the character level n-gram frequency of the search pattern and each candidate, and calculates the cosine similarity between them relying on the n-gram frequency. The default n-gram degree is 2. Don’t worry if you have not heard about n-gram and cosine similarity before. You can easily use the following code to run the search:

resultList = FuzzySearch. mostSimilarByCosineSimilarity(String searchingPattern, List candidates, double threshold, int degree)

You will also need to provide a threshold (should be a float between 0 and 1 ). This method will calculate the similarity score between the search pattern and each candidate in the search string list. After all the scores are calculated, the result list will be returned containing all candidates whose similarity score is greater than the threshold, order by the highest to the lowest. A threshold around 0.40 is usually reasonable. If the search strings in the candidate list are too long, you can set up a lower threshold.

You can also try to customize the n-gram degree according to your use case. If you raise the degree, the score will be lower which can help filter out more irrelevant results. Please be aware that the length of search pattern CANNOT be lower than the n-gram degree. If in your use case it is very common to have a very short search pattern, you should use a low degree or use word count search. You are allowed to use 1-gram (degree = 1) but it is strongly unrecommended, because 1-gram means that it only matters how many letters in common between the search pattern and the candidate, without considering the order or letters. For example, “live” and “evil” will have similarity score 1.0 when n-gram degree is 1.

Option C: Edit Distance Search

Edit Distance is a common metric to quantify the similarity between two strings by calculating the steps needed to convert one string to the other. We provide two types of edit distance in FuzzySearch.groovy:

Levenshtein distance, which allows insertion, deletion, and substitution.
Longest common subsequence(LCS) distance, which only allows insertion and deletion, not substitution.

For example, the Levenshtein distance between the word “kitten” and “sitting” Is 3, including these three steps:

The first letter ‘k’ to be substituted by ‘s’
The fifth letter ‘e’ to be substituted by ‘i’
Insert the last letter ‘g’

While the LCS distance between them is 5 because LCS distance does not allow substitution. When we need to substitute a letter, we need first delete the old letter than insert the new letter, which is counted as two steps. So, the total distance between “kitten” and “sitting” is 2+2+1=5.

You can use the following code to run edit distance search:

resultList = FuzzySearch.mostSimilarByEditDistance(String searchingPattern, List candidates, int threshold, boolean allowSubstitution);

You need to fill in the search pattern, candidates, a threshold, and a boolean value which indicates that the substitution is allowed or not. If it is true , Levenshtein distance will be used, otherwise LCS distance will be used. You also need to provide a threshold in integer to filter out all candidates with edit distance higher than the threshold. The results will be ordered by the edit distance lowest to highest.

Comparison among three searching algorithms

The Cosine Similarity Search and the Edit Distance Search have the following weak points:

The words which have similar spelling will have higher similarity score / lower edit distance despite that they have irrelevant meanings, such as “kitten” and “sitting”.
If the candidates are significantly longer than the search pattern, they will have lower similarity score / higher edit distance despite that they may have several words in common.
Sometimes it might be hard to find out the appropriate threshold, especially when the lengths of the candidates are in a very wide range.

While the Word Count Search has the following weak points:

Two words can be considered as a match only when they are identical. This means that no part-of-word match is allowed, and normalization is essential for the languages in which the words have plenty of inflections.
Only the candidates with most words in common with the search pattern will be included in the result. All other candidates will be discarded no matter how similar they are with the search pattern.

Usually if the candidates are long strings and the search patterns extracted by user input are expected to be short, search by Word Count will be a safer choice. If you need to take into account part-of-word matches, you can choose between Cosine Similarity and Edit Distance. Anyway, we encourage you to do experiments on different searching algorithms to find out the best fit for your use case.

Exact Match Search

Besides the three Fuzzy Search algorithms mentioned about, the Groovy code also provides an exact match search method. This search method only returns the candidates which are identical (case insensitive) to the search pattern. You can use the follow code to run it:

resultList = FuzzySearch.exactMatches(String searchingPattern, List candidates)

4. Post-processing

After getting the search results, you need to pass them to Post-Processing. In this phase, you need to retrieve the item names in the result list by using the name-search string mappi ng created in the Pre-processing phase. For example, if the result list contains this candidate: jfk new york john f kennedy international airport, you need to find the corresponding item name John F. Kennedy International Airport . If you use the name list as the search string list , you do not need this step.

Furthermore, you can create some specific rules to filter out some of the results. For example, the user has mentioned “Airport in Los Angeles” and the fuzzy search result list contains two airports: Los Angeles International Airport and San Angelo Regional Airport. The first one is the main airport of Los Angeles, while the last one is a regional airport in Texas. If you have recognized the city name Los Angeles before (usually by a listener) and use it as a filter, you can filter out this irrelevant result from the result list and only keep the first one.

5. Multi-result Follow-up

Now you have a result list ready to be used in the reply to the user input. When the result list contains multiple items, you may want the user to choose one of them. If you are using Teneo Web Chat, it is very easy to set up a clickable list by the flow with the following code, in which the variable resultList represents the result list returned from the Fuzzy Search:

def clickablelist=[:]
clickablelist.type="clickablelist"
clickablelist.list_items=[]
for (item in resultList){
	def button=[:]
	button.title=item
	button.postback=item
	button.parameters=["item":item]
	clickablelist.list_items<<button
}
JSON = new groovy.json.JsonOutput().toJson(clickablelist)

Copy the code, create a script node, paste the code, and create a variable name JSON in the flow. After that you need to put an output parameter with the name teneowebclient and value ${JSON} in the output node. The output text usually is something like “I’ve found n results according to your input. Please choose one from the list below”. Here is an example of a clickable list in Teneo Web Chat:
clickable list in TWC

And use the following code to capture the information about which item the user has clicked:

itemClicked = engineEnvironment.getParameter("item")

You can read this article for more information about using input parameters.

6. Conclusion

In this article you have learned how to create the whole process of Fuzzy Search. According to your use case, you can choose between Word Count search, Cosine Similarity search and Edit Distance search, or even create your own searching algorithm if you find that these algorithms do not work well with your use case. We strongly encourage you to share with other developers in the Teneo Developer Community your use case on Fuzzy Search in Teneo to make it improve.

Thank you for reading this article. If you have more questions about Fuzzy Search, raise your hand in the Teneo Developer Community to get help from other developers!

Topic	Replies	Views
Localize your Teneo bot with multilingual Conversational Language Understanding Announcements	624	26 August 2022
New content on Language Understanding Announcements	715	19 August 2021
OpenAI - Teneo GPT Connector Announcements	602	17 February 2023
New Generative QnA Template solution Announcements announcements	227	26 March 2024
Handling Disambiguation with Teneo Knowledge Articles teneotuesday , ux , dialog-management , disambiguation , ambiguity	597	30 August 2022