Zero-Shot In-Context Multi-Choice Question-Answering Bench

Task: In-context question answering (QA) is a the task where a model answers questions based on a provided text passage, using only the given context without relying on external knowledge. In-context question answering is used in all retrieval-augmented generation (RAG) solutions where a generative LLM is tasked to provide an answer to a user's question using documents (or document fragments) retrieved from a semantic database.

Dataset: Belebele dataset. The dataset contains 900 questions for each language. Each question comes along with a text passage (context) and four possible answers. Since the questions are about the text passage and the selection of the best answer relies on understanding, which answer best answers the question based on the given text fragment, this makes the dataset useful for benchmarking in-context QA, albeit a very simple use case where the LLMs task is just to output the correct answer's ID. This benchmark allows to see, which models could be useful for in-context question answering (and RAG solutions) in practice, but as it benchmarks just a simple scenario, make sure you test each model for your language and specifics of your task.

Prompt example:

{"messages":
        [{"role": "system", "content": "Answer the question by choosing the answer that can be inferred from the given text. Report only the correct answer's number using the following JSON format: {\"answer_id\":\"\"}."},
        {"role": "user", "content": "Text:\nOne of the most common problems when trying to convert a movie to DVD format is the overscan. Most televisions are made in a way to please the general public. For that reason, everything you see on the TV had the borders cut, top, bottom and sides. This is made to ensure that the image covers the whole screen. That is called overscan. Unfortunately, when you make a DVD, it's borders will most likely be cut too, and if the video had subtitles too close to the bottom, they won't be fully shown.\nQuestion: According to the passage, which of the following problems might one encounter when converting a movie to DVD format?\nPossible answers:\n1) An image that doesn't fill the entire screen\n2) Partially cut subtitles\n3) An image that fills the entire screen\n4) Cut borders\n"}]
}

Metrics: Accuracy and below we are also interested to see, whether the LLM's output data in the required format.

Accuracy Bar Chart

Hover with your mouse over the bars to see top model scores for each language.

Accuracy Heatmap

Output Format Analysis

Here we analyse whether the LLMs are able to follow the instructions to return only the asked JSON object:

{"answer_id":""}

We distinguish four levels of obedience:

Output is the asked JSON object, e.g.:
```
{"answer_id":"1"}
```
Output contains the asked JSON object, e.g.:
```
```json
{"answer_id":"1"}
```
```
Output contains a number (it could also be because there is a JSON object, but its structure is wrong), e.g.:
```
The correct answer is 1
```
Complete and utter failure (all kinds of hallucination without even a number in it), e.g.:
```
The sky is blue, isn't it?
```

Model	Output contains the required JSON object	Output contains a number	Output is a complete failure
BSC-LT-salamandra-7b-instruct	0.000	0.981	0.019
claude-3-5-haiku-20241022	0.999	0.001	0.000
claude-3-7-sonnet-20250219	0.999	0.000	0.001
dolphin3	0.998	0.002	0.000
gemma2:27b	0.994	0.005	0.000
gemma2:9b	0.994	0.006	0.000
gemma3	0.997	0.003	0.000
gemma3:12b	1.000	0.000	0.000
gemma3:27b	1.000	0.000	0.000
gpt-3.5-turbo	1.000	0.000	0.000
gpt-4o	1.000	0.000	0.000
gpt-4o-mini	1.000	0.000	0.000
llama3.1	0.999	0.001	0.000
llama3.1:70b	0.999	0.001	0.000
llama3.2	1.000	0.000	0.000
llama3.3	0.999	0.001	0.000
mistral-large	0.999	0.000	0.001
mistral-nemo	0.999	0.000	0.001
mistral-small	0.999	0.001	0.000
mistral-small3.1	1.000	0.000	0.000
nemotron	0.990	0.010	0.000
olmo2:13b	0.983	0.017	0.000
openGPT-X-Teuken-7B-instruct-commercial-v0.4	0.544	0.452	0.004
openGPT-X-Teuken-7B-instruct-research-v0.4	0.683	0.307	0.010
phi4	0.977	0.012	0.011
phi4-mini	1.000	0.000	0.000
qwen2.5:1.5b	0.658	0.339	0.003
qwen2.5:72b	0.997	0.002	0.001
utter-project-EuroLLM-1.7B-Instruct	0.000	1.000	0.000
utter-project-EuroLLM-9B-Instruct	0.050	0.940	0.010