EKI: Meta does not have privileged access to Estonia's language database

Social media giant Meta does not have a privileged position when using Estonia's linguistic corpus data as the conditions are the same for everyone, said Director of the Estonian Language Institute (EKI) Arvi Tavast. However, he noted it is difficult to determine who actually owns the data.
Last week, Minister of Justice and Digital Affairs Liisa Pakosta (Eesti 200) met with representatives of the social media platform Meta to discuss, among other topics, the potential provision of Estonian language corpus data to the company. This data set is extensive and is used for language research, description, and language technology.
Tavast said the corpus collects all digitally available texts, but far more are needed than are currently accessible. Therefore, efforts are being made to acquire additional texts. However, not everything found on the internet can be included.
"Availability means that first, the text must exist in digital form, then it must be technically accessible, and finally, its licensing terms must be appropriate," he told Vikerraadio's "Uudis+."
Tavast said the question of who owns linguistic corpus data is a complex one. As he is not a lawyer, he cannot provide a definitive answer.
"This is a complicated situation that has arisen due to technological developments occurring after legal regulations were put in place. This means that the regulations have not been able to account for the current reality," the director said.
EKI does not own the corpus
Tavast said a linguistic corpus is a database that is not simply the sum of its parts.
This means that if a database contains texts restricted by copyright, personal data protection, or confidentiality, those restrictions apply to that specific text. The corpus as a whole is subject to additional restrictions.
EKI, however, is not the owner of the linguistic corpus.
"Management is perhaps a more accurate term. We coordinate the collection. We have not always gathered all of it ourselves — this work began at the University of Tartu in 1998, and currently, language technology in Estonia is organized so that the Estonian Language Institute oversees it. Among other things, we allocate funding for it, and corpus collection is one of the tasks carried out under Estonia's national language technology program," he told the show.
National programs and development plans set the conditions for use and they are the same for everyone, Tavast explained.
"Meta is not in a privileged position — quite the opposite," he said. "There are two main issues. First, whether the data is being sold — the simple answer is no. There is no mechanism by which the state could sell this data. Second, the phrase 'handing over' sounds as if nothing remains with the provider, as if it has been given away. Nothing of the sort has been done. The data is available to anyone developing relevant services."
The EKI director said the institute's preference would be to support local developers. However, he considers it wrong to assume that Meta gains significant value from EKI's linguistic corpus, as the opposite is true.
"For years, together with the Ministry of Justice and Digital Affairs, we have been trying to convince Meta to accept this data to improve its Estonian language model. Selling has never been on the table — rather, we are the ones who gain value from this," he said.
Tavast pointed out that Meta's artificial intelligence budget this year is $60 billion — almost three times the size of Estonia's state budget. The EKI could never afford to carry out similar work on its own.
"We cannot allocate three state budgets solely for training a language model. If Meta does this for us in a way that allows us to use the results, it is extremely beneficial for the preservation of the Estonian language and culture," he said.
Over half of the corpus comes from media sources
Tavast said Meta serves as a good example of developing an open model, while many others work on closed models. This means that if Meta uses linguistic corpus data, Estonia can benefit from the model in return, whereas if a closed-model developer used the data, there would be no access to the resulting model.
He said approximately 60 percent of the linguistic corpus content comes from the media. The rest comes from texts not subject to copyright, such as EU legislation.
"For example, daily news is also not considered copyrighted content. The remaining portion consists of various materials — forum comments, academic papers, fiction. Essentially, all available material on the Estonian language," the institute's director outlined.
What differentiates a linguistic corpus from freely available internet data is that the corpus is annotated. Each word is tagged with information about its type and syntactic behavior, meaning it has undergone significant research and development.
"Previously, annotation was essential for language research or language technology applications. However, recent technological advancements, especially in the last few years, have allowed large language models to learn directly from raw text data available online without the need for such preliminary linguistic research," Tavast explained.
He believes that major developers, such as ChatGPT, have likely processed much more Estonian-language text than what is contained in EKI's linguistic corpus.
Tavast stressed this field requires a societal consensus, and professionals in the field are pleased that the topic is finally receiving broader attention.
He added that the goal is to continue discussions to ensure that, on the one hand, Estonia develops artificial intelligence that speaks Estonian and aligns with Estonian values, while on the other hand, no one's rights are infringed upon.
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Karin Koppel, Helen Wright
Source: Uudis+, interview by Lauri Varik