EKI: Language model developers not interested in Estonian language corpus

Although concerns have arisen in Estonia about providing Estonian language data to AI developers, there is no reason to worry: AI companies have absolutely no interest in the Estonian language corpus, said Arvi Tavast, director of the Institute of the Estonian Language (EKI).
Tavast said one of the problems when discussing language model training is the lack of regulation. For example, it is unclear whether training a language model qualifies as scientific research, which would determine whether it falls under the text and data mining exemption for research. If so, this would allow the corpus to be used for free.
"Since these issues are not regulated, even lawyers cannot provide legally binding answers. The final answer can only come from court practice, which is very limited. In Europe, there is one court ruling stating that this qualifies as scientific research, meaning data can be used for training without the original author's permission. However, more case law is needed, preferably from the European Court of Justice," Tavast said.
The director explained that the Estonian language corpus database was compiled by EKI and is licensed by the institute with a citation requirement. However, the works contained within the corpus still belong to their original authors, and their copyright status has not changed due to their inclusion.
"If someone wanted to use the works in the database, for example, to republish them, EKI has no influence over those rights, this is a matter between the copyright holders and the data users," Tavast said.
"There is no difference whether a user accesses a work from the EKI database or from the public web, unauthorized use is not allowed in either case," he added.
AI developers are not interested in the Estonian language
Tavast said big international AI developers have not taken data from EKI's language corpora so far.
"Since 2020, the Estonian state has been working at both the official and political levels to improve the representation of the Estonian language in large language models, including attempts to persuade major developers to use our corpus data. So far, without success. Even Meta's official response to our data offer has been: 'Thank you, we appreciate your offer, but using these data is not currently among our priorities,'" Tavast said.
One reason is that the Estonian language market is too small.
"For example, Mistral AI responded that they are focusing first on languages with higher demand. Secondly, AI developers find it easier to collect data themselves from the internet, or 'crawl' them. We extract data from public sources, and they do the same, but their capability for doing so is significantly greater. It is much easier for them to take the entire internet, regardless of which languages it contains or who owns the data, and train their models on that," the director explained.
In reality, Tavast said, language model developers are interested in something else entirely — they need expert-compiled data on how the world works. Unfortunately, EKI does not have such data to offer them.
Even if AI developers did show interest in the Estonian language, Tavast said none of them would be willing to pay for it.
"Hoping that someone would pay for the texts in the corpus database is completely unrealistic. They do not even need anything from us for free. Besides, the same training material is already available for free on the internet. They have no motivation to pay for it. In fact, we should be paying them to train models for us," he explained.
What if AI developers do become interested in Estonian language corpus data?
Mati Kaalep, head of the Estonian Authors' Union, said agreements can be made regarding any kind of content, but it is important to identify the rights holders and negotiate terms with them.
"Since language corpus developers often need a large volume of text, it would make sense to start negotiations with those who hold the largest collections," he said.
Kaalep believes the most logical approach would be for market participants to make direct agreements with each other.
"If, for example, Meta or OpenAI wants to use text data, such as texts from media companies, they should be able to make agreements directly with those media houses," he said.
Since copyright issues fall under the jurisdiction of the Ministry of Justice in Estonia, Kaalep believes the ministry should bring both sides together and coordinate their communication if necessary. He noted that while the ministry has attempted to develop the digital sector, copyright holders have not received equal attention.
On February 6, Minister of Justice and Digital Affairs Liisa Pakosta (Eesti 200) said that she supports providing Estonian-language data, including Estonian-language media content owned by ERR, to major AI companies for free. She argued that this would contribute to the constitutional goal of ensuring the survival of the Estonian language. A few days earlier, Pakosta had met with a representative from Meta.
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Mari Peegel, Helen Wright