Estonia gives Meta access to 4 billion words for large language model development

Estonia will make almost 4 billion words available for technology giant Meta – which owns Facebook and Instagram – to train large language models, the Ministry of Justice said on Thursday. Minister of Justice and Digital Affairs Lisa Pakosta (Eesti 200) said the survival of the Estonian language relies on such deals and giving away data for free.
The ministry said Meta is the second company developing large language models to be given access to the Estonian language corpus' open data, which contains nearly 4 billion words.
"Sharing Estonian-language data creates the precondition for large language models to understand the cultural context of Estonia and become more proficient in using the Estonian language," the press release said.
It also creates better services for Estonian-speaking users in various AI-based applications – such as chatbots, translation systems, and other language technology solutions, it added.
Pakosta said it is "crucial" that large language models take the Estonian language and culture into account.
Estonia is open to cooperation and ready to share its language data with other large language model developers, the ministry said.
It urges both the public and private sectors to publish data in order to increase the volume of high-quality Estonian-language data.
Pakosta met with Meta's representatives in Central and Eastern Europe on Wednesday.
Pakosta: We are protecting the Estonian language
On Thursday morning, the minister discussed the issue with ERR's Vikerraadio.
The Estonian language corpus was developed by the Institute of the Estonian Language and is a collection of Estonian words specifically created for use by digital platforms.
"Its sole purpose of ensuring that Estonians can use digital platforms in their native language and that the Estonian language used by these platforms is correct, refined, and preserved over time," Pakosta said.
"Our overarching national interest in Estonia is to ensure the survival of the Estonian language. This will only be possible if artificial intelligence can use and understand Estonian and grasp Estonian culture. If we ourselves have actively contributed to Estonian being as actively used in the artificial world as English or French, for example. This means that we have to look at what is a fair agreement," the minister said.
She denied the corpus is a data set or that Estonians data is being handed over to American companies.
Pakosta also denied theft of media materials will occur – as has been reported in the U.S. – and said this is not the same thing as allowing the company to use the language corpus.
However, she said discussions are taking place about the use of ERR's back catalog, the content of which is taxpayer funded.
She said this deal is government-backed to protect the Estonian language. It is a "constitutional objective" to make sure the language continues to exist and stays relevant, the minister said.
"We are, in essence, at the threshold of a major societal shift, and in this transformation, we must find ways to ensure that the Estonian language survives through time and remains in active use. Naturally, the rights of all parties must be taken into account," she said.
When asked how much Estonia is being paid for the use of the language corpus, she said that is not the point of the deal.
"In reality, our agreement is structured in such a way that our interest lies in having them use these Estonian words and sentences and integrate the Estonian language into all their applications. This requires certain developments on their part. Meanwhile, we are continuing to develop digital usage opportunities for the Estonian language so that all service providers can access them free of charge. Our goal is for private companies worldwide to use Estonian, and to achieve that, we need to work quite actively," she told Vikkerradio.
Asked if she would meet with a representative of China's AI language model Deepseek if requested, Pakosta said she certainly would.
"However, we primarily consider which countries share our values, which are built on freedoms, and where our legal framework is similar. We view such countries in one way. Meanwhile, countries where our legal and value systems are not as aligned, we view differently. That is quite clear," she told the show.
Editor's note: On Thursday evening, the ministry retitled its press release "Meta is interested in using the open data of the Estonian language corpus." The original title suggested a deal had already been done.
The ministry also told newspaper Postimees a seperate deal had not been concluded with Meta.
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Aleksander Krjukov, Helen Wright