TildeBench - Large Language Model Leaderboard

Zero-shot Multi-Choice QA Benchmark
Zero-shot In-Context Multi-Choice Question-Answering Benchmark

Tests the performance of LLMs in zero-shot multi-choice question-answering and analyses how well they output results in the required format.

Languages covered: Czech, English, Estonian, Finnish, French, German, Hungarian, Latvian, Lithuanian, Polish, Russian, Ukrainian

One-shot Machine Translation Benchmark
One-Shot Sentence-Level Machine Translation Benchmark

Tests the performance of LLMs in one-shot translation capabilities. Sentences are translated without document context (document-level benchmark coming soon...).

Languages covered: English, Estonian, French, German, Latvian, Lithuanian, Polish, Russian

One-Shot Sentence-Level Machine Translation Robustness Benchmark
One-Shot Sentence-Level Machine Translation Robustness Benchmark

Tests the robustness of machine translation models, LLMs, and commercial MT services when encountering unseen data and tagged content, which are typical things that production systems must be able to deal with.

Languages covered: English, Latvian

LLM Tokenizer Benchmark
Large Language Model Tokenizer Benchmark

Compares LLM tokenizers (total number of tokens generated, token-per-word ratio, vocabulary size).

Languages covered: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Ukrainian

LLM Error Analysis
LLM Error Analysis for Languages of the Baltic States

Manual error analysis for Latvian, Lithuanian, and Estonian. Focus on European LLMs.

Languages covered: Estonian, Latvian, Lithuanian

More benchmarks coming soon...