Linguist: In seven years AI will understand sarcasm in Livonian

The Estonian Center of Excellence in Artificial Intelligence is developing a large language model in Estonian. Mark Fišel, professor of linguistics at the University of Tartu, explains that researchers are also working on developing a language model for smaller Finno-Ugric languages through speech synthesis and recognition.
Large language models such as ChatGPT and GPT4 are a hot topic in science today, according to Fišel. "One of their drawbacks is that they work well for languages with many speakers, texts, and data. This means that even the best of today's language models and products, including GPT4, are not as intelligent as English when working in Estonian," he said.
Supporting small languages with a high-quality language model will be one of the directions the new center of excellence will pursue, he said. One way to do this, Fišel says, is simply to collect more data on each small language. Another option is for researchers to automatically translate texts into small languages, giving the machine more learning resources. "There are four billion words in the combined corpus of the Estonian Language Institute, but that is not enough. So, we have translated 20 times more texts from other languages into Estonian. They are not a substitute for human-generated texts, but at least they give us a way to teach the models, even though only roughly," the professor said.
The third and most exciting way, according to Fišel, is to change the way language models are taught. The language acquisition of the human child could be taken as an example. In the first five years of life, a human being hears five million words. This is enough to develop an incomparably better understanding of language and intelligence than an artificial pig. "So it's not impossible; it's just that our methods are not perfect. Maybe we can develop better methods that don't need billions of words but can get by with less," Fišel said.
According to the professor, this is where the University of Tartu's neurospeech speech synthesis and the Tallinn University of Technology's automatic transcription could come together. "Let's see if one can support the other. For example, can speech synthesis generate data to identify a language? Can we do this multilingually?" he said. Since collecting a large amount of data on the small Finno-Ugric languages is unlikely, we can create a language model based on a language family, where one language can support another.
Estonia's own chatbot?
According to Fišel, a large Estonian language model should meet three criteria: "The model must be open and free for everyone to use. It must be competent and modern. And it must be able to speak Estonian.
At the moment, Estonians have models that meet two out of three criteria. "For example, ChatGPT and GPT4 are modern and support Estonian, but they are not free," the professor said. The freeware models that handle Estonian are outdated and lack the self-explanatory features of the new major language models. On the other hand, Facebook's parent company, Meta, is developing new freeware models that do not support Estonian.
Two Ph.D. students supervised by Fišel have already made the first attempt to teach Estonian to Llama 2 in the Meta language model without the model forgetting English. "We called it Llammas in Estonian," the professor said.
But this was just a scientific experiment, and Fišel, linguistic technologist Kairit Sirts, and automatic transcription developer Tanel Alumäe are currently seeking funding from their research groups to create a strong freeware language model for Estonian. We are striving to develop a robust freeware language model for Estonian, one that is suitable for both government and business use. Estonian would then have its own Llama 2, Mistral, Claude, or ChatGPT," the professor said.
Work still to be done
At the inaugural event of the Estonian Center of Excellence in Artificial Intelligence, Mark Fišel gave a talk entitled "No, we are not done yet!", where he tried to dispel the misleading impression that the big language models are already very good and that there is no point in spending money on developing an Estonian language model. "If you start to systematically assess how good the model is, even in English, there are still a lot of holes or gaps to fill," he said.
The model is effective at routine and well-rehearsed tasks, but struggles with higher-level logic. Fišel's team gave both humans and the model Sherlock Holmes-style detective stories to read. Fišel then asked them to predict the identity of the murderer. "While humans do this with an average accuracy of 47 percent, the best language models do it pretty randomly. GPT4 got about 28 percent," Fišel said. Since there are typically four characters to choose from, the professor says that nearly 25 percent of the time the answer is more likely to be random.
On the other hand, well-trained language models can recognize sarcasm and irony. "When unexpected texts come into play that contain dialect, jargon or very specific vocabulary, the quality of the model immediately drops. It may not even be able to handle English," Fišel said.
Figuratively speaking, he says, by the time the new center of excellence is seven years old, Livonian, Votian, and Istric will have enough textual resources to create a model that understands sarcasm, among other things, in those languages. At the very least, you could ask the language model, "Here's a whole collection of folklore in Livonian. What are the most common themes? Or you could ask about the origin of the Livonian phrase tēriņtš," he said. Without having to train millions of words, the model would be able to answer questions about both language content and history.
According to Fischel, the Centre of Excellence in Artificial Intelligence will be doing more than just languages over the next seven years. "This includes more fundamental work to ensure that artificial intelligence is trustworthy – that it does not fool others or fool us," he said. It will also focus on practical applications of artificial intelligence in e-government, e-learning, health data, and business analytics. "All these methods will become more effective in the context of the Estonian language, not even in seven years, but before that," Fišel said.
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Kristina Kersa