Data Augmentation - Methods and Use Cases

If you are using Machine Learning in Teneo Studio, you need enough training examples to obtain a high-quality machine learning model. We suggest you create at least 10 training examples for each class (also called intent), with an optimal level of 20 examples per class. However, it is very possible that you cannot reach this number with the raw data you have. If that is the case, Data Augmentation can help you generate more training examples. In this article, we are going to provide you several Data Augmentation techniques.

Create examples manually

When you lack of training examples, manually creating them sentence by sentence is always an option. The quality of these examples is good, but the process is extremely time consuming. It is only recommended when you only have a few classes in your solution, or in case you only need a few more examples (let’s say less than five) per class. In most cases an automatic data augmentation technique is recommended.

Back translation

Back translation is a simple and effective data augmentation method for text data. You just need to translate the text data to another language, then translate it back to the original language. Usually, the text returned is slightly different than the original text while preserving all the key information.

If you find the returned text too close to the original text, you can try to add another language to make a translation cycle. For example, you can first translate the original text in language A to language B, then translate the text in language B to language C, and finally translate the text in language C back to language A. Below is an example of back translation (use Google Translate):

  • Original: Usually, the text returned is slightly different with the original text while preserving all the key information.
  • English β†’ Spanish β†’ English: The returned text is usually slightly different from the original text and retains all the key information.
  • English β†’ Spanish β†’ Chinese β†’ English: The returned text is usually slightly different from the original and retains all key information.

Easy data augmentation

Besides back translation, there are many other simple data augmentation methods which can be easily automated.

  • Replace by synonym
    Randomly choose n words which are not stop words from the original sentence and replace them with synonyms. The quality of the new texts is usually better than the texts generated by other easy data augmentation techniques. However, you need a word database containing a synonym dictionary in the corresponding language, such as PPDB or WordNet. It might be difficult to find a high-quality word database for non-English languages.

  • Random insertion
    Randomly choose a word which is not a stop word from the original sentence, search for its synonym, and insert its synonym into a random position in the sentence. Repeat this process n times. This method relies on a synonym dictionary as well, and you may get a syntactically wrong result.

  • Random deletion
    For each word in the original sentence, remove it with a probability p you give. This method does not require any word database so you can apply it to texts in any languages. However, there is a certain chance to lose key information and generate bad examples.

  • Random Swap
    Randomly choose two words from the original sentence and swap their positions. Repeat this process n times. This method does not require word database as well, and you will not lose information because no word is removed. However, there is a certain chance to generate a sentence with invalid syntax. Furthermore, if you are using a training model which does not consider the word order for example Bag-of-words model, random swap is totally useless.

For applying these easy data augmentation techniques, you can use the Python code in this repository. You need to install nltk package before you use it. We recommend you use Anaconda which is a distribution of Python language with essential packages for machine learning and commonly used console Jupyter Notebook pre-installed.

Advanced word replacement technique

Comparing with the traditional synonym replacement method which relies on a synonym dictionary, the more popular technique is Deep-Learning driven word replacement via word embedding. Instead of replacing the words in the original text with their synonyms in dictionary, a pre-trained language model is used to find the closest words in the vector space and use them to replace the original words.

  • Non-contextual embeddings
    These language models are trained without considering the context information, for example Word2Vec and GloVe. They were quite popular before the Transformer was published in 2017.

  • Contextual embeddings
    The most advanced NLP techniques using most popular pre-trained models based on Transformer, such as BERT and RoBERTa.

Do not worry if you are not familiar with deep learning, as you can find open-source Python packages to help you do all these stuffs.

Automatic Text Data Augmentation by NLPAug

NLPAug is a python package which covering all these data augmentation techniques for text data and acoustic data. You can easily install it using the pip command: pip install nlpaug in the Windows command lines or Anaconda prompt (if you have installed Anaconda) and find the source code in github.

After the pip install process is finished, open your python console (python notebook recommended, for example Jupyter Notebook. If you have installed Anaconda it’s already included) and import the part of this package you need. For example, the following line imports the word level text augmenter:

import nlpaug.augmenter.word as naw

Then create an augmenter object corresponding to a specific data augmentation method. For example, the following code will create a BERT based augmenter for synonym replacement named aug:

aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="substitute")

Please note that according to the augmentation method you choose you may need to install extra Python packages, for example gensim for non-contextual embeddings and transformers for contextual embeddings. During the execution of the augmenter certain pre-trained model may need to be downloaded, which could be hundreds of Megabytes or even more. Make sure that your Internet connection works well, and that you have enough space for storing these model files temporally.

After creating the augmenter, you can generate training examples from your original data like this:

augmented_text = aug.augment(text)

Below is an example of doing data augmentation with Contextual embeddings (RoBERTa):

Besides these advanced data augmentation techniques, the NLPAug package also provides you with a wide range of methods for easy data augmentation and back translation. Open this Python notebook for a quick start.

Conclusion

This article introduces different techniques related to text data augmentation, which is a very important step to improve the quality of your machine learning model. We also encourage you to share your use case of data augmentation in the Forum to help other Teneo developers. We hope you found this article useful, and feel free to ask here any questions you might have on this topic.

2 Likes