Finland's ChatGPT equivalent begins to think in Estonian as well
The University of Turku in Finland is developing an artificial language corpus proficient in all European languages, including Estonian, to preserve minor languages in the post-ChatGPT era. The Estonian Language Institute (EKI) supports the initiative but warns that an operational language model requires digitizing substantially more Estonian texts than is currently available.
"The English ChatGPT stunned the world with its ability to comprehend and respond to a person in a manner resembling natural speech. It wasn't a miracle technology; it was given an unprecedented amount of text to identify patterns in and learn to mimic human communication," Eleri Aedmaa, a natural language processing engineer at the Institute of the Estonian Language, said.
"In the new era of language technologies, the quantity of texts matters. To reach this critical mass with Estonian as many texts as possible must be digitized and made available: the entirety of the National Library, all of its archives, as much current and historical news as possible, including online communication. The more Estonian is readily accessible on the Internet, the more secure the language's future will be," the linguist said.
University of Turku and language technology company SiloGen are spearheading the creation of the largest open language model in the world, covering all official European languages, on one of the pan-European supercomputers, LUMI, is located in Kajaani, Finland. It is the third largest supercomputer in the world and the largest in Europe. The quantity of distinct and original digital Estonian texts that can be made available for this and future language learning models, according to Aedmaa, is a fundamental issue for training the model in Estonian.
ChatGPT only thinks in English
Aedmaa said that one of the shortcomings of the increasingly prevalent large language models is that they are trained almost exclusively in English. This implies that although GPT-4 appears to comprehend Estonian, it is still limited to translation. The machine, so to say, thinks in English and translates the conversation at the last moment into Estonian. "This is really dangerous for the Estonian language in the long run," Aedmaa said.
"The value of these new tools lies in the fact that they comprehend not only individual words and sentences, but also the cultural context as a whole. If a language model is trained solely on the basis of English content, it will inevitably lack cultural knowledge of Estonia," she said.
"The situation is comparable to when the printing press was invented - what would have become of the Estonian language if books had been printed only in the major languages, but not in Estonian?" She asked. According to Aedmaa, the problem now affects nearly all of the world's languages.
The language model being developed by the Finns is a GPT-like digital machine that has been trained on a wide range of languages from the ground up. "The objectives are European linguistic sovereignty and the democratization of language technology. Unlike the majority of its predecessors, the new language model will be open source; its logic is transparent and can be utilized by anyone developing new language technology applications," Aedmaa said.
The Finns' project is supported by Business Finland, a body similar to the Estonian Business and Innovation Agency (EISA). It has also been supported by the EU Horizon program. LUMI provided a number of free training hours to the developers so that they could test the model.
There are too few sources in Estonian to train a large language model
Kadri Vare, head of the EKI's language and speech technology department, said the agency is presently exploring additional ways to help Finns. "We intend to work with them and have taken preliminary steps in that regard. Then we will be able to specify more precisely what and how much we can contribute to this endeavor. We have already contributed by making all of our legally permissible language data available to them," Vare told ERR.
In particular, Vare said, even more data could be digitized and made public to contribute to the success of the initiative. "The large language models use the entirety of the internet and every written word. Right now, we do not know exactly what they have taken and from where; we do not know if they have access to potentially more sensitive data. It would be important to us to find out," she said.
There is a shortage of available Estonian content for the large language model at present, however. "There are approximately three billion words of Estonian public data in the main language corpora, according to our knowledge. In contrast, English contains over 800 billion words. Three billion may seem like a lot, but in reality it is still insufficient. Training an artificial intelligence to comprehend Estonian language and culture is insufficient. It is simply too little," Vare said.
At the moment, the EKI is in the process of compiling a large Estonian language corpus. "These datasets are public, and we are pleased to share them. Participating in large open language models and collecting data for them is, in my opinion, one of the most important goals for Estonian language preservation," she concluded.
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Kristina Kersa