Convert tsv file to json required by Microsoft CLU

In Teneo Studio, you can fill in your training data in the Class Manager by the Import Classes option. It requires your data to be in tsv format, in which each line should contain the intent class name and the example sentence, separated by tab. However, if you are a user of Microsoft Conversational Language Understanding (CLU) and decide to use the CLUxTeneo approach, you may need to convert these tsv files into json formatted data which is the format the Microsoft Azure Langauge Studio requires.

You can use the following Groovy code to convert your tsv file for Teneo Class Manager to a json file which is ready to be imported as a new CLU project or as an utterance file, which can be used to add data to an existing CLU project as training or testing data.:

import groovy.json.JsonBuilder

public static void main(def args){
    
	String path = "" // Input file path
	String language = "en-us" // Put langauge code here, by default "en-us"
	String mode = "project" // Generate json for new project or utterance file, by default "project"
	String projectName = "my_clu_project" // Output file name in json format
	String separator = "\t" // Separator between intent and example in input file; for tsv please use "\t"

	for (int i=0;i<args.size()-1;i++){
		if(args[i]=="-f") path = args[i+1]
		if(args[i]=="-l") language = args[i+1]
		if(args[i]=="-m") mode = args[i+1]
		if(args[i]=="-n") projectName = args[i+1]
		if(args[i]=="-s") separator = args[i+1]
	}
		
	if (!path) {
		throw new Exception("Please provide the name of the input file as argument -f")
	} else if(mode!="project"&&mode!="utterance"){
		throw new Exception("Invalid mode. Please put -m project for CLU project file and -m utterance for utterance file")
	} else {	
		
		BufferedReader reader = new BufferedReader(new FileReader(path))
		String fileLine
		List<Map> examples = new ArrayList()
		List<String> trainingData = new ArrayList()		
	
		try {
			while ((fileLine = reader.readLine()) != null) {
				String[] tt = fileLine.split(separator)
				String intent = tt[0], text = tt[1]
				if (!trainingData.contains(text)){
					trainingData << text
					Map m = [:]
					m.intent = intent
					m.text = text
					m.language = language
					m.entities = []
					examples << m
				}
			}
		} finally {
			try {
				reader.close();
			} catch (err) {}
		}
		
		Map cluApp = [:]	
		cluApp.projectFileVersion = "2022-05-01"
		cluApp.stringIndexType = "Utf16CodeUnit"
		cluApp.metadata = [:]
		cluApp.metadata.projectKind = "Conversation"
		cluApp.metadata.projectName = projectName
		cluApp.metadata.multilingual = false
		cluApp.metadata.language = language
		cluApp.assets = [:]
		cluApp.assets.projectKind = "Conversation"
		cluApp.assets.intents = []
		List<String> intentNames = new ArrayList() 
		for (example in examples){
			if (!intentNames.contains(example.intent)) intentNames << example.intent
		}
		for (intentName in intentNames){
			cluApp.assets.intents << ["category":intentName]
		}
		cluApp.assets.entities = []
		cluApp.assets.utterances = examples
		if (mode == "project"){
			String cluJson = new JsonBuilder(cluApp).toString()
			File outputFile = new File(projectName+".json")
			outputFile.write(cluJson)
		} else {
			String cluJson = new JsonBuilder(examples).toString()
			File outputFile = new File(projectName+"_utterance.json")
			outputFile.write(cluJson)
		}	
	}
	
}

Please note that the code above is not designed to be used inside Teneo Studio. To execute the code above you need to have Groovy installed in your computer. You can install Groovy following the guide here.

After you set up Groovy in your computer, you can copy the code, create a new text file and paste the code, save it as tsvToClu.groovy (or any other names you want, just remember to set up the extension as .groovy), and put it in the save folder where you have your tsv file to be converted. Then open the Windows Command Prompt, change the current working directory to the folder containing this groovy file, and run the following command:

groovy tsvToClu.groovy your_file.tsv -f input.tsv -l en-us -m project -n my_clu_project -s \t

You need 1 obligatory argument and 4 optional arguments to run it:

  • -f : Obligatory, stands for the input file name. You will receive an error message if this argument is not detected.
  • -l : Optional, stands for the language code. By default en-us, so if your data is in American English, you do not need to add this argument. Click here for the full list of supported languages.
  • -m : Optional, should have value project (stands for json for new project) or utterance (stands for json for utterance file). Other values are not allowed. By default project, so if you want to generate a json file for a new CLU project you do not need to add this argument .
  • -n : Optional, stands for the name of the project. By default my_clu_project. The output file name will be [project name].json (for importing project) or [project name]_utterance.json (for importing utterance data)
  • -s : Optional, stands for the separator. By default \t, so if you use tsv file as input, you do not need to add this argument. If you have a csv file, please put “,” as separator.

The code in this post is an example on how you can convert your tsv file (or files with other kinds of separator like csv) for class import in Teneo Studio to a json formatted file for Microsoft CLU. As it is executed outside Teneo Studio, you could also write similar code in any other programming language. Hope this post can help in your CLUxTeneo project!

3 Likes