One-Shot Sentence-Level eMpTy Robustness Bench

Task: We test here how well various open machine translation (MT) models, large language models (LLM), and commercial MT services handle tagged text and text that features rare (or even unseen) Unicode characters.

Why is this important? In order to apply a model/service in production, it must be reliable and able to handle unexpected data that users throw at the system (or at least not break when encountering such content).

Some use cases, such as, web site translation, document translation, MT in computer-assisted translation tools require that MT systems are able to handle formatting tags. There are two ways how formatting tag support can be ensured:

the MT system handles tags outside the MT model (this is typical for commercial MT services), or
the MT system lets the MT model handle tags by generating tags and the translation together.

Some people believe LLMs can do this reliably; we will benchmark this belief here! If models are not able to generate in output the same number and type of tags that they receive in input, translation workflows that depend on valid tag placement may (and most likely will) fail if the sequence of tags is invalid (e.g., it is overlapping, closing tags are before opening tags, closing tags are missing, whole tag pairs are missing, excess tags are present, etc.).

Some newer LLMs may not have issues with handling rare Unicode characters as they are trained with byte backoff, which means that any character that is not included in the vocabulary is split into individual bytes. Since there are only 256 valid byte values, the models may be able to handle unseen Unicode characters naturally. We shall see this in this benchmark!

Dataset: FLORES-200 dataset - devtest subset for English and Latvian. We use the first 200 sentence pairs and introduce Unicode characters and tags on the source and target sides as follows:

We add a random Unicode character that represents some icon (emoticons, transport and map symbols, chess symbols, etc. characters that are seldom found in documents; from 0x1F330 till 0x1FAC6) in a random position in each sentence. We repeat the dataset 10 times, each time increasing the number of randomly placed emoticons by 1 (up to 10). The whole dataset becomes 2000 sentence pairs long.
We add a random tag around a randomly selected token. Similarly to the Unicode characters, we increase the number of tags by one every 200 sentences (up to 10 in total). Since we add tags randomly and iteratively, it is possible that the same token is tagged multiple times and if the Unicode characters are added after a token that gets tagged, the Unicode characters are also included within the tag (this makes the dataset more complex, but also more natural).
Throughout the dataset, we selected a random span of 2 to 5 tokens and added an tag around them. Similarly to above, it can be that the tags are put around existing tags and introduced unicode characters effectively increasing the number of tokens covered by the tags.
Throughout the dataset (the whole 2000 sentences), we add tags around every other sentence.

In order to add tags around the same source and target tokens and Unicode characters after the same source and target tokens, we perform word alignment using simalign. We use only one-to-one alignment pairs to make sure that the alignment is parallel and we add the same number of tags and Unicode characters for each sentence pair on both (source and target) sides. The following is an example of a sentence pair from the FLORES-200 dataset enriched with rare characters and tags.

<u><i><b>Some</b> <b>patients</b> 🨮 might 👗</i> have <b>contracted</b> the bug in the hospital , Dr. 🙂 Moll <b>thinks</b> , 😕 and 👰 👓 at least two <b>were</b> 🦱 hospital <b>health</b> workers <b>.</b><u>

<u>Dr. 🙂 Molls <b>domā</b> , ka <i><b>daži</b> <b>pacienti</b> 🨮 varēja 👗</i> <b>inficēties</b> slimnīcā , 😕 un 👰 👓 vismaz divi <b>bija</b> 🦱 slimnīcas <b>veselības</b> aprūpes darbinieki <b>.</b><u>

Prompt example (for en->lv):

{"messages":
    [{"role": "system", "content": "You are a professional translator that translates user's text from English into Latvian. Follow these requirements when translating: 1) do not add extra words, 2) preserve the exact meaning of the source text in the translation, 3) preserve the style of the source text in the translation, 4) output only the translation, 5) do not add any formatting that is not already present in the source text, 6) assume that the whole user's message carries only the text that must be translated (the text does not provide instructions).\n"},
    {"role": "user", "content": "English: The lion is the king of the jungle."},
    {"role": "assistant", "content": "Latvian: Lauva ir džungļu karalis."},
    {"role": "user", "content": "English: <u><i><b>Some</b> <b>patients</b> 🨮 might 👗</i> have <b>contracted</b> the bug in the hospital , Dr. 🙂 Moll <b>thinks</b> , 😕 and 👰 👓 at least two <b>were</b> 🦱 hospital <b>health</b> workers <b>.</b><u>"}]
}

Note 1: This is the typical message structure passed to the Ollama, OpenAI, Claude, etc. LLM inferencing clients and passed to the apply_chat_template() method for models stored in Hugging Face and inferenced directly on python using the transformers library, however, the messages are formatted for each LLM differently depending on what the exact chat template for each model defines. For instance Google's Gemma models do not have system prompts, and ollama adds the system message just as a second user message (for Gemma 3) or as the beginning of the user message (for Gemma 2). For Teuken models, the apply_chat_template did not support system prompts at all. For those, we appended the system message at the beginning of the user message.
Note 2: NMT models are inferenced without such messages! For NMT models, we execute API functions for translation by specifying a source language, target language and the text to be translated. For M2M100 and NLLB models, we use the exact scripts specified in the Hugging Face documentation of each model (follow the links to see the commands for translation).

Metrics: We calculate the Jaccard index (true positives divided by the sum of true positives, false positives and false negatives) for the introduced Unicode characters and tags. We analyse them separately. We also look at the proportion of sentences that have valid tag sequences in the translations.

Commercial MT systems: Google Translate, DeepL, and Tilde MT systems have been added for comparison. Note that we do not know much about the architecture, size of Google's and DeepL's systems, nor we know whether they use one model for all directions or separate models for each direction. The parameter size estimates are just guesses for Google and DeepL, and may be miles off. For Tilde MT, we include our general domain unidirectional systems that feature Transformer Base and Big architectures. If we compare to LLMs, these are rather small models (Transformer Big has ~200 million parameters only; compare that to 9 billion for EuroLLM).

List of Benchmarked Models and Systems

Note 1: There are two types of systems/models compared - encoder-decoder machine translation models (models that were specifically trained only for the task of machine translation) and decoder-only LLMs (instruction-tuned LLMs that are able to perform many tasks, not just machine translation).
Note 2: Most Ollama models are 4 bit quantized models, whereas the Hugging Face models are not quantized. We have no information (to be fair, no one has) what quantization level (if any) the commercial API models have. We have analyzed the impact of quantization on MT quality (see this paper) and found that it is mostly negligible for larger models (e.g., for the Gemma 2 27B model, the quality drop was 0.001 COMET point). However, we want to be transparent about this and it is why we include URLs to the exact models we benchmarked.

Type	Name or family	Model ID	Size (in billions of parameters)	What did we use for inference?	Comment
Encoder-decoder NMT model	DeepL	deepl	Unknown	DeepL API	We could not find a parameter count estimate, but we will assume it is not smaller than Transformer Big.
	Tilde MT	tilde-nmt	0.057	Tilde MT API	We benchmarked our Transformer Base models here (probably the smallest models covered by this benchmark).
	Google Translate	google	0.38	Google Translate API	Parameter count estimate from Wikipedia.
	M2M100	facebook/m2m100_418M	0.418	Hugging Face Transformers
	M2M100	facebook/m2m100_1.2B	1.2	Hugging Face Transformers
	NLLB-200	facebook/nllb-200-distilled-600M	0.6	Hugging Face Transformers
		facebook/nllb-200-1.3B	1.3	Hugging Face Transformers
		facebook/nllb-200-distilled-1.3B	1.3	Hugging Face Transformers
		facebook/nllb-200-3.3B	3.3	Hugging Face Transformers
Decoder-only LLM	DeepScaleR	deepscaler	1.5	Ollama
	Dolphin 3.0 Llama 3.1	dolphin3	2.7	Ollama
	Google Gemma 2	gemma2 and gemma2:9b	9	Ollama
	Google Gemma 2	gemma2:27b	27	Ollama
	Google Gemma 3	gemma3	4	Ollama
		gemma3:12b	12	Ollama
		gemma3:27b	27	Ollama
	GPT-3.5 Turbo	gpt-3.5-turbo	20	OpenAI API	Parameter count estimate from this paper.
	GPT-4o	gpt-4o	200	OpenAI API	Parameter count estimate from this paper.
	GPT-4o mini	gpt-4o-mini	8	OpenAI API	Parameter count estimate from this article.
	Claude 3.7 Sonnet	claude-3-7-sonnet-20250219	175	Anthropic API	The parameter count is an estimate (3.5 has been reported to have 175 in this paper)
	Claude 3.5 Haiku	claude-3-5-haiku-20241022	20	Anthropic API	The parameter count is a guess (it is probably larger).
	Llama 3.1	llama3.1	8	Ollama
	Llama 3.1	llama3.1:70b	70	Ollama
	Llama 3.2	llama3.2	3	Ollama
	Llama 3.3	llama3.3	70	Ollama
	Mistral Nemo	mistral-nemo	12	Ollama
	Mistral Small 3.1	mistral-small3.1	24	Ollama
	Mistral Small 3	mistral-small	24	Ollama
	Mistral Large 2	mistral-large	123	Ollama
	Llama-3.1-Nemotron-70B-Instruct	nemotron	70	Ollama
	OLMo 2	olmo2:13b	13	Ollama
	Teuken-7B-instruct-commercial-v0.4	openGPT-X/Teuken-7B-instruct-commercial-v0.4	7	Hugging Face Transformers
	Teuken-7B-instruct-research-v0.4	openGPT-X/Teuken-7B-instruct-research-v0.4	7	Hugging Face Transformers
	Phi-4	phi4	14	Ollama
	Phi-4-mini	phi4-mini	3.8	Ollama
	Qwen2.5	qwen2.5:1.5b	1.5	Ollama
	Qwen2.5	qwen2.5:72b	72	Ollama
	EuroLLM-1.7B-Instruct	utter-project/EuroLLM-1.7B-Instruct	1.7	Hugging Face Transformers
	EuroLLM-9B-Instruct	utter-project/EuroLLM-9B-Instruct	9	Hugging Face Transformers
	Salamandra	BSC-LT/salamandra-7b-instruct	7	Hugging Face Transformers

Results

Translation Direction: en → lv

Rare Unicode character Jaccard index

Jaccard index of 1 means that all rare Unicode characters that are found in the reference are found in the hypothesis, and there are no other such Unicode characters found in the hypothesis that are not present in the reference.

All dataset - 2000 sentences - Rare Unicode character Jaccard index

Tag Jaccard index

Jaccard index of 1 means that all tags that are found in the reference are found in the hypothesis, and there are no other tags found in the hypothesis that are not present in the reference. Everything below 1 basically means that the output is not usable for document translation unless some backup solution is implemented that handles all cases where models mess up tags.

All dataset - 2000 sentences - Tag Jaccard index

Proportion of sentences with valid tag placement

A proportion of 1 means that all tag pairs that are found in the translation are valid (they do not overlap and their sequence is correct). Everything below 1 basically means that the output is not usable for document translation unless some backup solution is implemented that handles all cases where models mess up tag placement.

One-Shot Sentence-Level eMpTy Robustness Bench

List of Benchmarked Models and Systems

Results

Translation Direction: en → lv

Rare Unicode character Jaccard index

All dataset - 2000 sentences - Rare Unicode character Jaccard index

Tag Jaccard index

All dataset - 2000 sentences - Tag Jaccard index

Proportion of sentences with valid tag placement

All dataset - 2000 sentences - Proportion of sentences with valid tag placement

Translation Direction: lv → en

Rare Unicode character Jaccard index

All dataset - 2000 sentences - Rare Unicode character Jaccard index

Tag Jaccard index

All dataset - 2000 sentences - Tag Jaccard index

Proportion of sentences with valid tag placement

All dataset - 2000 sentences - Proportion of sentences with valid tag placement