Task: We test here how well various large language models (LLM) and commercial MT services translate documents (text spanning up to 20 paragraphs in each request).
Why up to 20 paragraphs only? Large language models (even the best-performing ones) cannot yet output reliably the same number of paragraphs as found in the input. The longer the text, the more likely that the models will start hallucinating (merging content, dropping some content, etc.) when generating translations. Our tests showed that at 20 paragraphs models were still more or less reliable.
Datasets:
Prompt example:
{"messages":
[{"role": "system", "content": "You are a professional translator that translates documents. Translate the document between the backticks (```) from English into Latvian. Enclose the translation also with backticks (```...```). Follow these requirements when translating: 1) do not add extra words, 2) preserve the exact meaning of the source text in the translation, 3) preserve the style of the source text in the translation, 4) output the translation using the same format (each line must start with the paragraph index), 5) do not apply formatting to the translation."},
{"role": "user", "content": "English:\n```\n<1> This is the first paragraph.\n<2> This is the second paragraph.\n```\n"}]
}
Note 1: This is the typical message structure passed to the Ollama, OpenAI, Claude, etc. LLM inferencing clients and passed to the apply_chat_template() method for models stored in Hugging Face and inferenced directly on python using the transformers library, however, the messages are formatted for each LLM differently depending on what the exact chat template for each model defines.
Note 2: TildeOpen MT model(s) support one specific translation instruction (e.g.: "Translate the following text from X to Y: {text}"). The model is prompted using one user message.
Note 3: The GPT-5.2 model is prompted with reasoning_effort: none.
Note 4: TranslateGemma models are prompted using the prompt defined in the model's documentation in the Ollama repository.
Metrics: ChrF (calculated using sacrebleu), COMET (we use the Unbabel/wmt22-comet-da model). Although translation is performed on document level. There are no good document-level metrics. We evaluate translation quality using segment-level metrics on segment level.
Note 2: Most Ollama models are 4 bit quantized models, whereas the TildeOpen MT model(s) is not quantized. We have analyzed the impact of quantization on MT quality (see this paper) and found that it is mostly negligible for larger models (e.g., for the Gemma 2 27B model, the quality drop was 0.001 COMET point). However, we want to be transparent about this and it is why we include URLs to the exact models we benchmarked.
| Type | Name or family | Model ID | Size (in billions of parameters) | What did we use for inference? | Comment |
|---|---|---|---|---|---|
| Decoder-only LLM | Google Gemma 2 | gemma2:27b | Ollama | ||
| Google Gemma 3 | gemma3:27b | Ollama | |||
| TranslateGemma | translategemma:27b | Ollama | |||
| GPT-4.1 | gpt-4.1-2025-04-14 | Unknown | OpenAI API | ||
| Llama 3.3 | llama3.3 | Ollama | |||
| TildeOpen | tildeopen-mt-v10 | vLLM | This is TildeOpen LLM fine-tuned for translation tasks. |