The Thai character confounding NLP engines

Bangkok, Thailand

If you’ve ever attempted to learn Thai, you can assume that this Southeast Asian language is extremely difficult if not the most difficult for machines to also understand. 

Thai is a character-based language with numerous quirks that disrupt natural language processing algorithms. Because of these quirks, leading NLP engines fail to understand Thai beyond the surface-level, causing an underwhelming customer experience. Why?

Thai is a notoriously difficult language for natural language processing engines to understand.

First, the language consists of several types of interjection words in a single sentence. Many of these words do not carry any meaning relevant to the sentence’s intent; these words are most often used to indicate emotion or an expression of politeness. At this point, you might be thinking: easy, just remove these words. Proto’s NLP team would have loved that too!

However, there is presently no accurate word separation technique for the Thai language. 

In English, we label such interjection words with a simple Part-of-Speech Tagging technique: a dictionary defines the problematic words and the NLP removes them. For example:

Oh! I can fly now.”

We would remove 'Oh' without disrupting the sentence’s intent. It is not so straightforward in Thai.

Let’s examine ค่ะ or คะ interjection words used frequently by female speakers to indicate politeness. The POS Tagging technique is unfeasible because the Thai dictionary cannot accurately separate words that contain ค่ะ or คะ in all of their various intents. For example:

“อยากรู้มั้ยคะ (Do you want to know?)”
“อยากรู้มั้ยคะแนนเท่าไหร่ (Do you want to know the score?)” 

In the first sentence, คะ is an interjection word indicating the politeness with which the speakers asks อยากรู้มั้ย

In the second sentence, คะ is a character within the word คะแนน (score). So you see, in order to identify ค่ะ or คะ in all their various forms, an NLP engine that actually understands Thai would require a significantly more advanced algorithm with additional pre-processing steps.

With the POS Tagging technique rendered useless, we could look to another method called Word Embedding.

This method converts each word in a language into a vector of numbers that represents some aspect of its meaning. Embedding is a common technique, nearly-universal across a wide range of NLP tasks; however, in the case of our คะ conundrum, embedding also has disqualifying limitations.

Popular word embedding libraries with Thai capability, such as 'Word2vec' and 'fasttext' are both trained on shallow language-modelling tasks that result in a loss of context, which in turn results in a misunderstanding of intent. For example:

“I stole money from the bank
“The bank of the river overflowed”

The word bank has different meanings according to the context of each sentence. The word embedding technique assigns a single vector to each word, which is forced to represent this wide range of possible intents. In Thai, relying solely on these vectors without context is a recipe for (shall we say politely) 'intellectually-challenged' chatbots.

These limitation across various techniques has led to demand from Proto's clients for what they call “real Thai NLP”. Specifically, they want chatbots that understand the language’s quirks.

So, without revealing Proto’s secret sauce, our NLP team developed a deep-learning model that trains a neural network to map word vectors according to (1) the entire sentence and (2) a word’s surrounding words. As a result, this deep-learning method delivers Thai chatbots powered by a more contextual word embedding algorithm.

Thus far, the algorithm has proven robust across various NLP tasks such as sentiment analysis and intent classification. In the example below, a Thai job application chatbot understands human intent with and without ค่ะ.

Proto's deep-learning technique maps Thai word vectors with a contextual embedding algorithm.

The commercial application of this deep-learning technique from Proto are far-reaching: enterprises can now deploy chatbots that not only deliver a savings advantage, but also a more humanized Thai customer experience compared to the competition.

Stay tuned for more innovations and insights from the NLP team at Proto!

Dr. Natapon Pantuwong is an NLP engineer at Proto, specialized in deep-learning techniques for the Thai language. He served on the faculty of the computer science department at King Mongkut's Institute of Technology Ladkrabang for eleven years. To reach Dr. Pantuwong, please write to him at