Zero-Shot Document-Level Machine Translation Benchmark

Task: We test here how well various large language models (LLM) and commercial MT services translate documents (text spanning up to 20 paragraphs in each request).

Datasets:

Prompt example:

{"messages":
    [{"role": "system", "content": "You are a professional translator that translates documents. Translate the document between the backticks (```) from English into Latvian. Enclose the translation also with backticks (```...```). Follow these requirements when translating: 1) do not add extra words, 2) preserve the exact meaning of the source text in the translation, 3) preserve the style of the source text in the translation, 4) output the translation using the same format (each line must start with the paragraph index), 5) do not apply formatting to the translation."},
    {"role": "user", "content": "English:\n```\n<1> This is the first paragraph.\n<2> This is the second paragraph.\n```\n"}]
}

Metrics: ChrF (calculated using sacrebleu), COMET (we use the Unbabel/wmt22-comet-da model). Although translation is performed on document level. There are no good document-level metrics. We evaluate translation quality using segment-level metrics on segment level.

List of Benchmarked Models

Type Name or family Model ID Size (in billions of parameters) What did we use for inference? Comment
Decoder-only LLM Google Gemma 2 gemma2:27b

27
Ollama
Google Gemma 3 gemma3:27b

27
Ollama
TranslateGemma translategemma:27b

27
Ollama
GPT-4.1 gpt-4.1-2025-04-14 Unknown OpenAI API
Llama 3.3 llama3.3

70
Ollama
TildeOpen tildeopen-mt-v10

30
vLLM This is TildeOpen LLM fine-tuned for translation tasks.

Results

Test sets

wmt24pp

Evaluated models: 3  |  Translation directions: 14

COMET scores

ChrF scores

wmt25_genmt

Evaluated models: 6  |  Translation directions: 7

COMET scores

ChrF scores