
Language and Artificial Intelligence
by
Advancements in AI models shrink the digital language gap.
Despite impressive strides forward technologically, only about 600 of the world's 7,151 languages* are meaningfully supported for online use. In fact, estimates suggest that 99% of the world’s online content is composed of only 40 languages. This leaves millions of people— many of whom are aspiring students, educators, future innovators and community leaders of all kinds—limited in their ability to access the huge store of information and resources available in the digital world. More tragic still, this gap in digital language access denies communities the opportunity to contribute their language, and the traditional knowledge encoded within it, to the wider global discourse.
SIL’s work alongside these communities includes the innovation of software and digital tools that help expand the number of languages supported digitally. Research and development in the realm of Artificial Intelligence (AI) and Natural Language Processing (NLP)—programming computers to process human language data—is a notable area of this work, and the SIL Innovation Team (IDX) recently contributed to two major developments in this field. These exciting advancements are an important step forward in supporting digital language vitality since both AI and NLP play an increasingly large part in powering features related to email, language learning, search engines, education, ecommerce, news and entertainment.
Working with researchers from the University of Dayton, SIL researchers found a way to create a new kind of multi-modal "language model" that trains AI systems to flexibly use multiple types of data (i.e. audio and text) available for languages in low resource settings. This work resulted in a publication at the ACL 2022 conference, where there was a special theme track on “Language Diversity: from Low-Resource to Endangered Languages.” Elsewhere, SIL used audio Bible data in collaboration with African researchers to produce high quality Text-to-Speech programs for six African languages, which are available to the public.
According to SIL Data Scientist, Dan Whitenack, “SIL is intentionally working to make sure that the benefits of AI and NLP extend to local language communities and to engage with researchers from a wider set of language communities in academic research on the topic. We are beginning to use the audio, text, video, and other data we have accumulated over 100 years to train new AI models that better represent the world's languages and that unlock new language possibilities.”
In addition to these areas, SIL is currently involved in multiple streams of AI development. Some examples include:
- Language identification
- Multilingual speech systems
- Local language chat
- Translation quality checking
- Multimodal language models
- Human in-the-loop machine translation
To learn more about SIL’s work in the field of AI, visit ai.sil.org.
The possibilities for future AI advancement are equal parts astounding and exciting. With digital becoming such a mainstay of the world’s commerce, education and entertainment, local communities experience greater opportunities to flourish when their languages are supported for online use. In view of this, SIL’s AI and NLP technology is built to represent language in ways it is actually used and communicated, whether spoken, written or signed. This includes a focus on multilingual community contexts. With technology continuing to advance at an accelerated rate, the need for innovation and research into areas of digital language support only increases.
*Reported in the 25th edition of Ethnologue, 2022.