Tokenizer Bench

Note that the comparison does not take into account that tokenizers may have different vocabulary sizes, which impact the size of the neural network (it may be harder to train neural networks with larger vocabularies). Smaller numbers do not necessarily mean a good thing. However, comparable numbers for overall token count accross languages do mean that some languages are better or worse supported.

Token-per-word ratio comparison

The token-per-word ratio metric shows the average number of tokens in which words are split for each language. Since languages are different, it would be stupid to aim to have this metric equal accross languages. For instance, Finnish and German use a lot of compounding. We do want that Finnish and German words are split into smaller parts. English has much less compounding and less inflections - we want English words to be split less. Nevertheless, it is important to see that there isn't excessive splitting happening (the average should be below 2).

Tokenizer vocabulary size comparison

For Mistral, the scores are the ones published by Mistral. For other models, the score is the number returned by AutoTokenizer.vocab_size.

Token count comparison for TildeLM focus languages

TildeLM is focused on Baltic, Finnic, and Slavic languages where we aim to have language equity. This for the tokenizer means that it should split the same text (regardless of the language) in more or less equal numbers of tokens.