Zero-Shot Document-Level Machine Translation Benchmark

Task: We test here how well various large language models (LLM) and commercial MT services translate documents (text spanning up to 20 paragraphs in each request).

Why up to 20 paragraphs only? Large language models (even the best-performing ones) cannot yet output reliably the same number of paragraphs as found in the input. The longer the text, the more likely that the models will start hallucinating (merging content, dropping some content, etc.) when generating translations. Our tests showed that at 20 paragraphs models were still more or less reliable.

Datasets:

WMT24++ dataset. The dataset covers 56 languages. We test translation only for 14 translation directions.
WMT25 General Machine Translation Shared Task test dataset. The dataset covers 16 translation directions. We test translation only for 7 translation directions involving European languages.

Prompt example:

{"messages":
    [{"role": "system", "content": "You are a professional translator that translates documents. Translate the document between the backticks (```) from English into Latvian. Enclose the translation also with backticks (```...```). Follow these requirements when translating: 1) do not add extra words, 2) preserve the exact meaning of the source text in the translation, 3) preserve the style of the source text in the translation, 4) output the translation using the same format (each line must start with the paragraph index), 5) do not apply formatting to the translation."},
    {"role": "user", "content": "English:\n```\n<1> This is the first paragraph.\n<2> This is the second paragraph.\n```\n"}]
}

Note 1: This is the typical message structure passed to the Ollama, OpenAI, Claude, etc. LLM inferencing clients and passed to the apply_chat_template() method for models stored in Hugging Face and inferenced directly on python using the transformers library, however, the messages are formatted for each LLM differently depending on what the exact chat template for each model defines.
Note 2: TildeOpen MT model(s) support one specific translation instruction (e.g.: "Translate the following text from X to Y: {text}"). The model is prompted using one user message. Note 3: The GPT-5.2 model is prompted with reasoning_effort: none.
Note 4: TranslateGemma models are prompted using the prompt defined in the model's documentation in the Ollama repository.

Metrics: ChrF (calculated using sacrebleu), COMET (we use the Unbabel/wmt22-comet-da model). Although translation is performed on document level. There are no good document-level metrics. We evaluate translation quality using segment-level metrics on segment level.

List of Benchmarked Models

Note 2: Most Ollama models are 4 bit quantized models, whereas the TildeOpen MT model(s) is not quantized. We have analyzed the impact of quantization on MT quality (see this paper) and found that it is mostly negligible for larger models (e.g., for the Gemma 2 27B model, the quality drop was 0.001 COMET point). However, we want to be transparent about this and it is why we include URLs to the exact models we benchmarked.

Type	Name or family	Model ID	Size (in billions of parameters)	What did we use for inference?
Decoder-only LLM	Google Gemma 2	gemma2:27b	27	Ollama
	Google Gemma 3	gemma3:27b	27	Ollama
	TranslateGemma	translategemma:27b	27	Ollama
	GPT-4.1	gpt-4.1-2025-04-14	Unknown	OpenAI API
	Llama 3.3	llama3.3	70	Ollama
TildeOpen	tildeopen-mt-v10	30	vLLM	This is TildeOpen LLM fine-tuned for translation tasks.

Zero-Shot Document-Level Machine Translation Benchmark

List of Benchmarked Models

Results

Test sets

wmt24pp

COMET scores

ChrF scores

wmt25_genmt

COMET scores

ChrF scores