• Skip to Content
  • Skip to Main Navigation
  • Skip to Search

Indiana University Bloomington Indiana University Bloomington IU Bloomington

Open Search Menu

The College of Arts & Sciences

Department of Linguistics

  • Home
  • About
    • Faculty
    • Adjunct Faculty
    • Emeriti Faculty
    • Staff
    • Graduate Students
    • Visiting Scholars Program
    • Department Collaborations
    • Diversity + Inclusion
    • History
    • About Bloomington
  • Undergraduate
    • Linguistics B.A.
    • Linguistics & Anthropology B.A.
    • Computational Linguistics B.S.
    • Computational Linguistics B.S. & M.S.
    • Minors
    • Courses
    • Advising
    • Honors Program
    • Phi Beta Kappa
    • Funding Opportunities
    • Scholarships & Awards
    • Student Experience
    • Career Preparation
  • Graduate
    • Master's Degrees
    • Ph.D. Degrees
    • Ph.D. Minors
    • Courses
    • Financial Support
    • Scholarships & Awards
    • Student Experience
    • Career Preparation
    • Graduate Student Directory
    • Awarded Degrees
    • How to Apply
  • Research
    • Research Areas
    • Research Opportunities
    • Groups & Labs
    • West African Languages Institute
    • IULC Working Papers
    • LINGUIST List
  • Alumni & Giving
    • Get Involved
    • Distinguished Alumni
    • Newsletter
  • News & Events
    • Events
    • Programs
    • Departmental News
  • Search
  • Contact
  • Student Portal
  • Home
  • News & Events
  • Departmental News
  • Natural Language Processing for Bantu Languages

Natural Language Processing for Bantu Languages

By: Kenneth Steimel

Monday, February 10, 2020

A student holding a model of a ship.
Kenneth Steimel is an advanced student in the computational doctoral program, who has worked extensively in areas applying advanced computation to technological applications related to African languages. He has been staff at LinguistList, here on I.U.’s campus, and is a recipient of the 2019 Householder Best Paper Award, an award based on faculty and alumni support, for his paper on labeling word classes in Bantu languages.

My primary research area is syntactic parsing and part of speech tagging of under-resourced Bantu languages. This is important because these languages are very under studied by Computational Linguists and are typologically very different from mainstream languages in Computational Linguistics. Working with Bantu languages gives us a better picture of the advantages and disadvantages of different systems.

One of the first things I wrote in this area was a paper on low-resource part of speech tagging for the Luyia language, Wanga, using cross-language tagging. For this, a collection of texts collected by Dr. Michael Marlo of the University of Missouri – Columbia, were used. A standard approach for part-of-speech tagging in English uses Hidden Markov Models (HMM). These models keep track of the likelihood of one part-of-speech tag following another (transition probabilities) as well as the likelihood of a given word having a particular part-of-speech (emission probabilities). These HMMs work okay for English if they have enough data to extract those probabilities from. However, for Wanga, they did very poorly (~40% accuracy). If you try to mix Swahili data in with Wanga data when obtaining these probabilities, it results in an even worse tagger. However, using an unconventional tagger does the trick.

A support vector machine (SVM) model that was less focused on the sequence of tags and instead used the characters present in the words themselves to determine part-of-speech did much better (~90% accuracy). The taggers character inputs were picking up on morphemes that were associated with particular parts-of-speech. On new words, the tagger still did very well. When working with very very small amounts of data, for example training a tagger on 60 sentences, using data from Swahili, a higher resource Bantu language, was beneficial. But as you use more data in the target language, the addition of data from Swahili has almost no effect on tagging accuracy.

So models that can leverage information about characters are important when working with Bantu languages and adding in data from other Bantu languages can be beneficial when working with very small datasets. However, how does language relatedness play into this? The high resource language used, Swahili, is a Bantu language but is still quite distant from Wanga. Another paper examined using the SVM model but augmenting it with several different languages including Swahili, Tiriki and Bukusu. Tiriki and Bukusu are both Luyia languages like Wanga, however, Wanga is more closely related to Bukusu. While both Tiriki and Bukusu augment the Wanga tagger better than Swahili, the more distantly related Tiriki was more useful. This was pretty surprising. The Tiriki data did notably better on loan words from Swahili which may have provided just enough boost.

When we as computational linguists include more languages, we get a more realistic view of what models work better in certain linguistic situations.

  • Faculty + Staff Intranet

Department of Linguistics social media channels

  • Facebook
  • College of Arts & Sciences
  • Department of Linguistics

The College of Arts & Sciences

Indiana University

Copyright © 2025 The Trustees of Indiana University

Accessibility | College Scorecard | Open to All | Privacy Notice

The College of Arts & Sciences

  • About
    • Faculty
    • Adjunct Faculty
    • Emeriti Faculty
    • Staff
    • Graduate Students
    • Visiting Scholars Program
    • Department Collaborations
    • Diversity + Inclusion
    • History
    • About Bloomington
      • Music + Entertainment
      • Campus Culture & Resources
      • Food & Restaurants
      • Sustainability
      • Sports & Fitness
      • Housing In Bloomington
  • Undergraduate
    • Linguistics B.A.
    • Linguistics & Anthropology B.A.
    • Computational Linguistics B.S.
    • Computational Linguistics B.S. & M.S.
    • Minors
    • Courses
    • Advising
    • Honors Program
    • Phi Beta Kappa
    • Funding Opportunities
    • Scholarships & Awards
    • Student Experience
    • Career Preparation
      • Career Advising
      • Internships & Fellowships
      • Undergraduate Teaching Assistant Program
      • Graduate & Professional Study
  • Graduate
    • Master's Degrees
    • Ph.D. Degrees
    • Ph.D. Minors
    • Courses
    • Financial Support
    • Scholarships & Awards
    • Student Experience
    • Career Preparation
    • Graduate Student Directory
    • Awarded Degrees
    • How to Apply
  • Research
    • Research Areas
      • Phonological and Lexical Encoding
      • Morpho-Syntax and Semantics
      • Field Linguistics
      • Computational Linguistics
      • Socio-historical Linguistics
    • Research Opportunities
    • Groups & Labs
    • West African Languages Institute
      • Mission + Vision
      • Funding Opportunities
      • Research Activities
      • Career Opportunities
      • Affiliated Programs & Departments
      • Directory
    • IULC Working Papers
    • LINGUIST List
  • Alumni & Giving
    • Get Involved
    • Distinguished Alumni
    • Newsletter
      • Newsletter Archive
  • News & Events
    • Events
    • Programs
    • Departmental News
  • Contact
  • Student Portal
    • Undergraduate
      • Linguistics B.A.
      • Linguistics & Anthropology B.A.
      • Computational Linguistics B.S.
      • Computational Linguistics B.S. & M.S.
      • Minor in Linguistics
      • Minor in Computational Linguistics
      • Honors Program
        • Honors Thesis Guidelines
      • Undergraduate Teaching Assistant Program
      • UnderLings
    • Graduate
      • Master's Degrees
      • Ph.D. Degrees
        • Qualifying Exams
      • Ph.D. Minors
      • Funding
      • IU Linguistics Club
    • Courses