AI Won’t Protect Endangered Languages
Combinations of characters on a screen mean nothing without agency and intention.
NOVEMBER 14, 2024
In November 2022, a new chatbot named ChatGPT seemed to start speaking English, producing reasonably coherent and grammatically correct sentences that many human English speakers found at least superficially convincing. Since then, ChatGPT, Google’s Gemini and Meta’s Llama have given the impression of knowing about 100 major languages, to varying degrees. The makers of these bots, perhaps conscious that their supposedly all-purpose tools cover only a fraction of human communication systems, are now setting their sights on the world’s 7,000 other languages — primarily oral, minority and Indigenous.
Overheated headlines like “Harnessing AI to Preserve the World’s Endangered Languages,” themselves now often generated by artificial intelligence, promise that this push will help the cause of linguistic diversity. By learning to “speak” these languages, the bots will supposedly ensure their survival, in one form or another, at a time when more and more languages are endangered. Some people, including many language activists, hope that the bots will serve as conversation partners, teachers, translators and even creators in these languages. For others, what matters is the symbolism of digital inclusion: the consolation that a language at least “lives” on a server somewhere.
But are these bots really “speakers” in the first place? And what is lost when we grant them that status, for the sake of convenience or out of desperation?
The large language models, or LLMs, that power tools like ChatGPT are so called not because they have a knack for languages but because they train on text — particularly the enormous, flawed, cocreated text known as the internet. In other words, they first take in the digitally available things that humans have previously written. They then remix and regurgitate this material in seemingly novel and sometimes useful ways.
LLMs may get better and better at sourcing certain kinds of information or completing certain kinds of tasks, but they are finders, not creators; they are mimics, not conversation partners; they are machines, not people.
When you ask an LLM a question, it runs calculations to predict which characters, words or sentences should go where to produce the semblance of a response. Call it auto-complete on steroids. A language model, as the linguist Emily M. Bender and her co-authors write, “is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.”
To invoke language when talking about LLMs is to misunderstand the nature of language and miss its fundamentally lived and embodied character. LLMs may get better and better at sourcing certain kinds of information or completing certain kinds of tasks, but they are finders, not creators; they are mimics, not conversation partners; they are machines, not people. It is striking to see bots recognized as “speakers” when we still fail to recognize as legitimate so many living, sophisticated communication practices. Many Indigenous languages are still ignorantly dismissed as not being “proper” languages; most hearing people are unaware that the world’s approximately 200 sign languages are fully independent linguistic systems; most animal communication practices are overlooked and misunderstood.
Languages are not born overnight but for the most part grow seamlessly out of other languages over generations and centuries.
Combinations of characters on a screen mean nothing without agency and intention, which bots lack but are designed to give the illusion of having. This illusion is the problem because it hollows out meaning. (“Agentic” AI, currently in the works, seeks only to deepen the illusion.) Meaning is neither “in” things nor in the mind, as the linguist Alton L. Becker wrote, but emerges “in the interaction of a living thing with its context.” Bots break the very idea of meaning by giving us text without context.
✺
All too slowly and haphazardly, I write these words (and you read them) in a contemporary, standardized form of American English, a language from the Germanic branch of the Indo-European family. From English to Russian to Hindi, hundreds of members of this far-flung family, for all their differences, share historical connections and similarities that demonstrably reach back several thousand years. This in turn is just what is recoverable — given the lack of written records and the current state of historical linguistics — of a continuous history of human language that stretches back at least 100,000 and possibly up to 1 million years. The traces of this unbroken linguistic inheritance, if you know where to look, are everywhere.
Languages are not born overnight but for the most part grow seamlessly out of other languages over generations and centuries. In fact, many linguists now doubt whether countable, neatly defined “languages” exist at all, at least without our forcing them into existence. There is no universal definition of what constitutes a language (versus a dialect, most famously). Nor can the fluid and variable linguistic practices of any community, or even the complex repertoire of any individual, be boiled down to a single fixed code of words and grammar.
Instead we all use multiple interacting codes paired with social meaning, as the linguist Jeff Good puts it, and it would be better to focus on documenting “the linguistic behavior and knowledge of individuals.” Writing with Michael Cysouw, he proposes that linguists abandon “language” for terms like “languoid” and “doculect” that can be more narrowly defined. At best a name like “English” (or “Estonian” or “Ewe”) “suffices as an informal communicative designation,” in the words of Cysouw and Good. We can use these designations, but should remember how much they oversimplify the fluid, layered, and multimodal nature of human communication.
For centuries, the practice of “speaking English” existed among a limited group of people bound by ties of place and kinship. Apart from neighboring peoples, few others had ever heard it, let alone learned it as a second language. Old English (as later scholars periodized it) evolved to Middle English and eventually Modern English, changing gradually across time and space beyond what any early speaker would have recognized. Forms of Norse, French, Latin and Greek used by more powerful peoples exerted a massive influence. Then English speakers began conquering Celtic, Native American, African, Australian and other peoples, pressuring or forcing them to give up their languages.
For the most part, I learned English by interacting with the small set of people around me during my first three or four years alive. My family had entered the language about a century earlier, after emigrating from Eastern Europe to New York City. Likewise, most of the world’s approximately 400 million native English speakers had other mother tongues in their families until relatively recently. Of course there is a hierarchy of dialects, with the U.K., the U.S., Canada, Australia and New Zealand often seen as an “inner circle,” crosscut by privilege and prejudice along racial, ethnic, class, regional and social lines. Then there are all the gradations and variations among the more than 1 billion people who have learned English as a second (or third or fourth) language very recently indeed.
Where AI promises magic, the most pressing need is for basic research, driven by communities.
The more it becomes a global lingua franca, the less English “belongs” to anyone in particular. Though still present, its historic layers and contemporary variants have been and are still being ruthlessly flattened, especially in writing. No single entity is in charge, but governments, companies, schools, publishers, media outlets and others have all been generating, spreading and standardizing English. The result is now a massive, interconnected and ever more predictable blob — the perfect training set for an LLM.
✺
Now consider Seke, an endangered, traditionally oral language with approximately 700 speakers from five villages in northern Nepal. Its dialects may differ as much as American and British English do. Most speakers today have moved to a handful of cities, especially New York, reflecting the worldwide urbanization of minority-language speakers.
In all these respects, and especially in its fundamental connection to a specific group of people and their history, culture and way of life, Seke is much more representative of the world’s languages than English. Likewise, Seke currently has no presence online, apart from an archive of materials I have been recording, transcribing and translating together with speakers in Nepal and New York. Forget LLMs — there is no Seke dictionary, nor any agreed-upon writing system, nor any books in or about the language.
Where AI promises magic, the most pressing need is for basic research, driven by communities. In-depth language documentation is difficult and costly, entailing years of work spent finding, getting to know and recording a range of speakers who can showcase as naturally as possible all the things a language can do. Properly probing a single, subtle element of grammar, like the use of tone or the way clauses are chained together, can be a serious accomplishment, not to mention the unsung arts of lexicography, transcription and archiving. When it comes to developing a language for modern life — beyond the daily oral use of its speakers — such steps cannot be skipped.
In no linguistically meaningful way is Seke deficient, however, nor is any “low-resource language.” Indeed, such languages often preserve the kinds of complex features that are wiped away or leveled off in a lingua franca like English, not to mention the natural variation patterns of embodied human communication. To know Seke is to have spent time with Seke speakers. Usually that means being born into, marrying into or living with a Seke family. Its local, oral, flexible character has served its speakers well for generations, maintaining an identity by indexing connection and belonging.
The bots “learned” their first 100 languages because there was already enough online to gorge on, at least in the languages that mattered most commercially and politically.
What happens when companies like Meta, Google and IBM target languages like Seke? To the extent that connection and belonging are the point — not to mention the sense of language ownership felt by many groups — stochastic parroting can feel like a violation. The processes of flattening and standardizing that became normalized over centuries in English could unfold overnight in Seke, with LLMs calling the shots — freezing and even defining what it means to know the language, especially as native speakers grow fewer. Garbage translations multiply online like fake news. Native speakers of the languages in question are bypassed as being “too hard to find,” compared with automated methods of vetting that are completely disconnected from real-life communication. While larger and more powerful language communities may be able to hold the bots to account and even make strategic use of them, it is all too easy to imagine languages like Seke being overwhelmed.
✺
For a bot, the first step is to identify any language it encounters (as Meta claimed last year that its technology has learned how to do for 4,017 languages). Beyond that, an ever-expanding universe of language technologies comes into view: automatic translation, speech-to-text transcription, text-to-speech, query responses and all the little tasks from spell-checking to mathematical reasoning that a bot is supposed to be good at. In fact, none of these tasks is neutral and objective: all must be mediated through particular languages and specific historical and cultural traditions.
The bots “learned” their first 100 languages because there was already enough online to gorge on, at least in the languages that mattered most commercially and politically. Substantial human tweaking has also been happening behind the scenes because the stakes are high. Any tech company efforts beyond those first 100 languages have generally had the character of stingy, half-hearted philanthropy. If there is now a new push to reach the next 1,000 languages or more, this is partly a result of activists and communities volunteering their time and labor to help their languages cross the linguistic digital divide. It’s also a matter of larger and larger LLMs, deployed with new methods and gobbling up enormous quantities of energy. Where once you needed a parallel corpus — the same text in two languages, enabling one-to-one comparison — the latest models are making inferences directly from the language in question, without translation. Some are now just “fine-tuning” existing LLMs with tiny amounts of data in the target language.
To an extent, prior work by linguists is now being broken down and fed to LLMs via new methods. A project called “Machine Translation from One Book,” for example, tried training an LLM on a linguist’s description of Kalamang, a primarily oral language with fewer than 200 speakers on an island in Indonesian Papua. While the results were reasonably promising in terms of stochastic parroting, the researchers noted that such work did not necessarily align with the goals of the Kalamang-speaking community or do anything to counteract the endangerment of the language. Errors introduced at any stage, they cautioned, would easily be amplified by the LLM. Likewise a group of IBM researchers in Brazil, in their work with speakers of Guarani Mbya and Nheengatu, noted the peril of “rogue memorization,” where a model keeps pulling from the original training data and not performing the translation (or other task) requested.
The deeper question is about control. Calls for Indigenous data sovereignty — “nothing for us without us,” as the 2020 Los Pinos Declaration puts it — apply forcefully to questions of language. For much of its existence as a field, linguistics itself has fallen into the trap of extracting languages from those who speak and sign them, and it has started to change course only in recent years under pressure. Languages like Seke are not inherently “open source” but may be felt in profound and complex ways to belong to particular communities, which may apply a range of taboos, strictures and norms to their use. Linguists have gradually been learning to work, through relational accountability, with and for communities rather than on them.
Some technologists, like the IBM researchers, are at least trying to do the same. The code may be programmed by outsiders, but these researchers recognize that any language data and associated tools should be under the control of the local community, and they warn that LLMs are becoming opaque jumbles of data and code. In the race to cover as many languages as possible, any available data of whatever provenance is being slipped into the mix and regurgitated for commercial gain.
Bots mimicking this data are no substitute for fluid, embodied human language in all its natural variation, situated in actual human contexts. The danger is that increasingly we won’t even be able to recognize the real thing. It’s one thing to document a language, as linguists try to do, but it’s up to the speaker themselves whether and how to maintain, transmit or revitalize their languages. Against the odds, hundreds of communities around the world are now launching organized efforts to do precisely this, with every word learned and every new speaker a hard-won triumph. These movements — not the bots — should have our attention and our support.