Large Language Model (LLM) Error Analysis for Languages of the Baltic States

We asked LLMs to generate responses to 10 separate questions. These questions were chosen such that LLMs would have to generate more or less a page of text (some did generate more, some less). Then, we asked professional linguists to mark errors in the text.
We wanted to understand how good do the European LLMs (Mistral, EuroLLM, Teuken) compare with a state-of-the-art open-weights model (Gemma3) in terms of linguistic text quality and reliability of the generated text.

Instructions given to annotators

The following instructions were given to annotators:
Your task: We need you to use the online tool to mark two categories of errors: moderate and severe.

Severe errors: punctuation errors, grammar errors, lexical choice errors  (inappropriate choice of words or made-up words, non-existing word forms), or anything else that you deem a gross linguistic mistake.
Moderate errors: capitalization errors, style errors, phrasing borrowed from English, or anything else that you deem a moderate linguistic mistake.

Additionally, we would like you to provide a short comment on the overall quality of each text: is the question answered as an educated human would, or does it contain made-up things?

Questions for which we generated answers

All questions require creative writing from the LLM. Most questions (8 out of 10) ar deliberately specific to regional contexts (e.g., local food, local politics, famous local people, etc.).
lv = ["Sagatavo maizes zupas recepti, kas uzskaita visas sastāvdaļas un darbības.",
"Sagatavo iesniegumu par bērna pabalsta piešķiršanu Rīgas pašvaldībai.",
"Uzraksti eseju par to, kāpēc jābūt labam cilvēkam.",
"Uzraksti argumentētu diskusiju, kas paskaidro, vai Baltijas valstis anektēja vai okupēja Padomju savienība.",
"Sagatavo Raiņa autobiogrāfiju (ieskaitot visus viņa darbus).",
"Sagatavo bloga rakstu par politiskām norisēm Rīgā 2020. gadā.",
"Precīzi (detaļās) apraksti Latvijas nacionālo tautas tērpu.",
"Sagatavo pārspriedumu par aktuāliem notikumiem Latvijā 2022. gadā.",
"Kas ir 10 populārākie latvieši, un ar ko viņi ir slaveni?",
"Kas ir populārākais Latvijas eposs? Uzraksti izvērstu sižeta aprakstu."]
lt = ["Parašykite cepelinų receptą, aiškiai nurodydami visus reikalingus ingredientus ir gaminimo etapus.",
"Parenkite prašymą dėl vaiko išmokos skyrimo Vilniaus miesto savivaldybei.",
"Parašykite esė, kurioje paaiškintumėte, kodėl svarbu būti geru žmogumi.",
"Parenkite argumentuotą diskusiją, nagrinėjančią, ar Baltijos šalys buvo aneksuotos, ar okupuotos Sovietų Sąjungos.",
"Parenkite Juozo Grušo autobiografiją, apimančią visą jo kūrybinį palikimą.",
"Paruoškite tinklaraščio straipsnį apie politinius įvykius Vilniuje 2020 metais.",
"Išsamiai aprašykite lietuvių tautinį kostiumą, jo istoriją, detales ir regioninius skirtumus.",
"Parašykite diskusiją apie 2022 metų aktualias temas Lietuvoje.",
"Išvardinkite dešimt populiariausių lietuvių ir parašyk kuo jie garsūs.",
"Koks yra populiariausias lietuviškas epas? Pateikite išsamų jo siužeto aprašymą."]
et = ["Koosta mulgipudru retsept, mis sisaldab kõiki koostisosi ja valmistamisjuhiseid.",
"Koosta Tallinna linnale lapsetoetuse taotlus.",
"Kirjuta essee teemal, miks peaks olema hea inimene.",
"Kirjuta argumenteeritud arutelu, mis annaks selgituse selle kohta, kas Nõukogude Liit annekteeris või okupeeris Balti riigid.",
"Koosta Eno Raua autobiograafia (mis sisaldab kõiki tema teoseid).",
"Koosta blogipostitus poliitilistest arengutest Tallinnas aastal 2020.",
"Kirjelda üksikasjalikult Eesti rahvariideid.",
"Koosta väitlus Eesti 2022. aasta aktuaalsetel teemadel.",
"Kes on 10 kõige populaarsemat eestlast ja mille poolest on nad kuulsad?",
"Mis on Eesti populaarseim eepos? Kirjelda üksikasjalikult selle süžeed."]
These are the Estonian questions translated into English:

Text generation setup

To generate text with gemma3:27b and mistral-large, we used ollama and queried it using the default chat method's parameters. To generate text with eurollm:9b and teuken:7b, we used the Hugging Face transformers python library. However, different from the first two models, we had to specify that beam search (of 3) must be used for both eurollm:9b and teuken:7b. If we used default parameters or if we set just temperature to 0, the models would excessively hallucinate (half of the texts were completely useless).

Error analysis results for Latvian

Error analysis results for Lithuanian

Error analysis results for Estonian

Hallucinations in LLM output

Eurollm:9b and teuken:7b for some documents generated incoherent, insanely repetitive, completely unusable text (and this is even after we switched decoding to beam search). The following chart summarises how many responses (out of the 10 generated responses for each language) were a complete failure (and discarded for further analysis). From a production-worthy LLM for this type of a task (simple text generation), we would expect to see zero hallucinations.

Overall count of tokens analyzed

The following chart summarises the number of tokens per model and language annotated by annotators (this excludes the documents that were identified as hallucinations as it would have been a waste of time to annotate those).

How usable and factually correct are answers?

We asked linguists to provide comments on how usable and factually correct the generated texts are. If the linguists indicated that a document contains false facts or is not usable, we marked that as an error.