Inquire2Finetuning

Today we want to share with you a script that can help you to leverage Teneo Inquire in order to convert a selected number of conversations from end users with your virtual assistant to data that can help you to finetune a Large Language Model to the needs of your company.

While off-the-shelf versions of Large Language Models, so called Foundation Models, can generate very good responses on generic questions, they lack the ability to answer correctly on company-related knowledge since this data is normally not public and with that not accessible for its training data.
We recommend to add the required company knowledge on runtime via Teneo RAG which has the advantage that the company knowledge can be easily updated and does not require a retraining of a finetuned model for each update to the your company database. You might want to combine though both approaches in the future and have a RAG setup which uses a finetuned model to generate the answer based on your company data. This could be the case because you want to finetune the way the model responds in a specific way or because you want to use a smaller language model but increase its performance on your use case by finetuning it with your data.

We see in the following how to prepare the data for the Finetuning of a GPT Foundation Model. A tutorial for the finetuning process itself can be found here. At the time of writing, GPT 3 and GPT 3.5 Turbo are available on Azure OpenAI for the finetuning process. You can find the Azure pricing around the finetuning process here. Be aware also that the finetuning process can be done for two purposes - training the foundation model on a new task (e. g. classification) or training it to answer relevant user questions in a specific way. We are looking into the latter use case today. You can find a relevant article on when to finetune here.

Teneo Inquire is key to avoid the possible data bottleneck as it allows you to query logs on specific criteria and export these logs into CSV and JSON format. Today’s Groovy code snippet is a converter from the mentioned JSON export into the required formats for GPT 3 and GPT 3.5 on Azure OpenAI.

import groovy.json.JsonBuilder
import groovy.json.JsonSlurper

class Inquire2Finetuning{

	// GPT3.5, GPT4
	public static String prepare4GptChat(input,output,systemMessage){
		def data = [:]
		data.messages = []
		if (systemMessage) data.messages << ["role":"system","content":systemMessage]
		data.messages << ["role":"user","content":input]
		data.messages << ["role":"assistant","content":output]
		return new groovy.json.JsonBuilder(data).toString()
	}
	// GPT 3
	public static String prepare4GptCompletion(input,output){
		def data = ["prompt":input,"completion":output]
		return new groovy.json.JsonBuilder(data).toString()	
	}
	public static String prepareLine(model,input,output,systemMessage=""){
		if("gpt3".equalsIgnoreCase(model.replace(" ",""))){ // if GPT3 selected
			return prepare4GptCompletion(input,output)
		}
		else { // default GPT3.5 / GPT 4 Chat format 
			return prepare4GptChat(input,output,systemMessage)
		}
	}
	public static void writeOutputFile(data, fileName){
		def outputFile = new File(fileName)
		outputFile.withWriter { writer ->
			data.each { element ->
				writer.writeLine(element)
			}
		}	
	}

	public static void main(def args){
		
		println "Running main."
		String inputFile = "" // Input file
		String model = "gpt3.5" // LLM model, for now only support for GPT models. By default GPT 3.5 (Chat API)
		String systemMessage = "" // System message, only for Chat API
		int validation = 10 // Split of validation set
		String skipEntries = "empty" // by default skip empty inputs from the logs (e.g. Welcome Messages)
		
		for (int i=0;i<args.size()-1;i++){	
			if(args[i]=="-i") inputFile = args[i+1]
			if(args[i]=="-m") model = args[i+1]
			if(args[i]=="-s") systemMessage = args[i+1]
			if(args[i]=="-v") validation = Integer.valueOf(args[i+1])
			if(args[i]=="-z") skipEntries = args[i+1] 
		}
		
		if (!inputFile) {
			throw new Exception("Please provide the name of the input file as argument -i")
		}
		else {
			
			// Get data
			def jsonSlurper = new JsonSlurper()
			def jsonData = jsonSlurper.parse(new File(inputFile))
			jsonData.shuffle()
			// Training - validation split
			def splitPoint = (int) Math.round(jsonData.size()*(validation/100))
			def trainingData = jsonData.subList(splitPoint,jsonData.size())
			def validationData = jsonData.subList(0,splitPoint)
			
			// Create Json Lines
			List<String> trainingJson = new ArrayList()
			List<String> validationJson = new ArrayList()
			
			for (dialog in trainingData){
				// Ignore empty input and output
				if ((skipEntries!="empty")||(dialog["Input"]&&dialog["Output"])){
					trainingJson << prepareLine(model,dialog["Input"],dialog["Output"],systemMessage)
				}
			}
			for (dialog in validationData){
				if ((skipEntries!="empty")||(dialog["Input"]&&dialog["Output"])){
					validationJson << prepareLine(model,dialog["Input"],dialog["Output"],systemMessage)
				}
			}
			
			// Create Jsonl file
			writeOutputFile(trainingJson,"training.jsonl")
			if (validationData) writeOutputFile(validationJson,"validation.jsonl")
		
			println "Done."
		}
		
	}
	
	
}

println "Starting Inquire2Finetuning."
def program = new Inquire2Finetuning()
program.main(args)

And here comes also a TQL query to get you started with your logs:

la s.id,  t.e1.userInput as Input,  t.e2.answerText as Output,  t.id,  t.time,  s.beginTime: 

exists t.e1.userInput,  t.e1-{type=="response"}>t.e2 , 

s.beginTime == in {"2024-01-01T00:00".."2024-01-11T23:59"} 

order by s.beginTime asc, s.id, t.time asc 

This example query gathers logs from 1st to 11th January 2024 which you can export into JSON within your Log Data Source.

Please make sure to anonymize user data if required for your project, and let us know your thoughts around Finetuning in the comments below!