Natural Language Processing for Bantu Languages: Departmental News: News & Events: Department of Linguistics: Indiana University Bloomington

Home
News & Events
Departmental News
Natural Language Processing for Bantu Languages

Natural Language Processing for Bantu Languages

By: Kenneth Steimel

Monday, February 10, 2020

My primary research area is syntactic parsing and part of speech tagging of under-resourced Bantu languages. This is important because these languages are very under studied by Computational Linguists and are typologically very different from mainstream languages in Computational Linguistics. Working with Bantu languages gives us a better picture of the advantages and disadvantages of different systems.

One of the first things I wrote in this area was a paper on low-resource part of speech tagging for the Luyia language, Wanga, using cross-language tagging. For this, a collection of texts collected by Dr. Michael Marlo of the University of Missouri – Columbia, were used. A standard approach for part-of-speech tagging in English uses Hidden Markov Models (HMM). These models keep track of the likelihood of one part-of-speech tag following another (transition probabilities) as well as the likelihood of a given word having a particular part-of-speech (emission probabilities). These HMMs work okay for English if they have enough data to extract those probabilities from. However, for Wanga, they did very poorly (~40% accuracy). If you try to mix Swahili data in with Wanga data when obtaining these probabilities, it results in an even worse tagger. However, using an unconventional tagger does the trick.

A support vector machine (SVM) model that was less focused on the sequence of tags and instead used the characters present in the words themselves to determine part-of-speech did much better (~90% accuracy). The taggers character inputs were picking up on morphemes that were associated with particular parts-of-speech. On new words, the tagger still did very well. When working with very very small amounts of data, for example training a tagger on 60 sentences, using data from Swahili, a higher resource Bantu language, was beneficial. But as you use more data in the target language, the addition of data from Swahili has almost no effect on tagging accuracy.

So models that can leverage information about characters are important when working with Bantu languages and adding in data from other Bantu languages can be beneficial when working with very small datasets. However, how does language relatedness play into this? The high resource language used, Swahili, is a Bantu language but is still quite distant from Wanga. Another paper examined using the SVM model but augmenting it with several different languages including Swahili, Tiriki and Bukusu. Tiriki and Bukusu are both Luyia languages like Wanga, however, Wanga is more closely related to Bukusu. While both Tiriki and Bukusu augment the Wanga tagger better than Swahili, the more distantly related Tiriki was more useful. This was pretty surprising. The Tiriki data did notably better on loan words from Swahili which may have provided just enough boost.

When we as computational linguists include more languages, we get a more realistic view of what models work better in certain linguistic situations.

Natural Language Processing for Bantu Languages

Department of Linguistics social media channels

The College of Arts & Sciences