Team of researchers to teach Estonian language, culture to language models
In a project led by the Institute of Computer Science at the University of Tartu (TÜ), open-source language models will be trained to speak Estonian more fluently and better understand Estonian culture, in order to preserve and protect the Estonian language in the face of the rapid development of artificial intelligence (AI) and create convenient applications for Estonians to use.
In order to function well, chatbots, text summarizers, content aggregators and question answering systems and other similar applications that utilize language models need for these models to have a good command of the Estonian language. Kairit Sirts, associate professor in natural language processing at the University of Tartu (TÜ), says that AI Estonian often comes across sounding artificial and wooden, according to a press release.
"Some open source language models already speak Estonian to a certain extent, but what we want to do is make the language models speak the way people actually speak," Sirts said. "Instead of eloquence, Estonians are more typically straightforward and laconic. We can train the model to take the Estonian cultural context into account, as well as improve its grammar."
Language models created by big tech companies are aimed at the masses, and no one in Estonia has control over them. OpenAI's ChatGPT, for example, can't be used in areas requiring confidentiality, such as national defense or healthcare. Now, Estonian researchers are continuing to train existing open-source language models with more Estonian texts so that it will be possible in the future to develop high-quality AI applications that speak Estonian and understand Estonian context.
According to the TÜ professor, it's important to maintain and develop competences in large language models (LLM) within Estonia's research community.
"For tech companies, the Estonian language situation and cultural background aren't important, so we have to stand up for these things ourselves," Sirts said. "Thanks to a new project, we can improve people's skills and knowledge as well, so that we don't just end up on the sidelines of technological development."
Language learning on LUMI
Launched this year, the project "Estonian language support in open source large generative language models" brings together the field's top expertise in Estonia.
Sirts is joined from TÜ by natural language processing professor Mark Fišel together with his students, and from Tallinn University of Technology (TalTech) by speech processing associate professor Tanel Alumäe together with his students. Leading the work is Eleri Aedmaa, natural language processing engineer at the Institute of the Estonian Language (EKI).
Language models will be trained on LUMI (Large Unified Modern Infrastructure), the fastest supercomputer in Northern Europe, located in Kajaani, Finland.
The project is being funded by the national program "Estonian Language Technology 2018-2027."
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Aili Vahtla