Understanding Generative AI Tokenization: A Deep Dive into Its Complexities
Generative AI models process text differently than humans. By exploring their “token”-based internal mechanisms, we can shed light on their peculiar behaviors and persistent limitations.
From small on-device models like Gemma to leading systems like OpenAI's GPT-4, most of today’s AI frameworks utilize a structure called the transformer. This architecture’s unique approach to linking text with other data types means it cannot work with raw text output directly — especially without significant computing power.
To navigate these challenges, contemporary transformer models operate with text broken down into smaller, manageable units known as tokens — a technique referred to as tokenization.
Tokens may consist of full words, such as “fantastic,” or they can be broken down even further into syllables like “fan,” “tas,” and “tic.” Depending on the tokenizer’s design — the model responsible for this breakdown — tokens might also represent individual characters (for example, “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).
Employing this tokenization method allows transformers to process more contextual information before hitting their maximum context window. However, this process can also introduce biases.
For instance, some tokens may have unusual spacing that confuses the transformer model. A tokenizer might encode the phrase “once upon a time” as “once,” “upon,” “a,” “time,” yet encode “once upon a ” (containing trailing whitespace) as “once,” “upon,” “a,” ” .” As a result, depending on whether the model is prompted with “once upon a” or “once upon a ,” the outputs may differ completely, illustrating the model's lack of human-like understanding of context.
Tokenization also treats letter casing differently. To a model, “Hello” might not equate to “HELLO”; typically, “hello” is one token, whereas “HELLO” can be broken into three tokens (“HE,” “L,” “LO”). This variance often leads transformers to struggle with capital letter variations.
“It’s challenging to pinpoint exactly what constitutes a ‘word’ for a language model. Even if experts could agree on the perfect token vocabulary, models likely benefit from further breaking down words into smaller segments,” says Sheridan Feucht, a PhD student focused on large language model interpretability at Northeastern University. “I believe there’s no perfect tokenizer due to this ambiguity.”
This “ambiguity” creates additional hurdles for languages outside English. Many tokenization techniques incorrectly assume that spaces in sentences denote new words, a design flaw stemming from their English-centric origins. This limitation becomes evident in languages like Chinese and Japanese, which do not use spaces to separate words.
A 2023 Oxford study revealed that due to differences in tokenization across languages, transformers can take twice as long to complete tasks phrased in non-English languages compared to their English counterparts. Other research highlights that users of less “token-efficient” languages may experience poorer performance from models while incurring higher costs since many AI companies charge based on token usage.
Tokenizers often treat each character in logographic writing systems—where symbols represent entire words—as separate tokens, leading to inflated token counts. Similarly, agglutinative languages like Turkish, which combine small meaningful elements known as morphemes to create words, also result in an increased number of tokens. For example, the Thai word for “hello,” สวัสดี, can count as six tokens.
In 2023, researcher Yennie Jun from Google DeepMind conducted an in-depth analysis comparing tokenization across languages and its wider implications. Using a dataset with parallel texts in 52 languages, Jun demonstrated that some languages required up to ten times more tokens for the same meaning conveyed in English.
In addition to language disparities, inconsistent digit tokenization could explain why current models struggle with mathematical concepts. Often, digits are tokenized unevenly; a model might recognize “380” as a single token but split “381” into two tokens (“38” and “1”). This inconsistency disrupts the relationships between numbers, leading to confusion. A recent study even indicated that models misinterpret repetitive numerical patterns or context, especially with temporal data (e.g., GPT-4 mistakenly interprets 7,735 as greater than 7,926).
Inadequate tokenization also impacts tasks such as solving anagrams or reversing words.
So, can these tokenization issues be resolved? There may be hope.
Feucht highlights promising advancements like “byte-level” state space models, such as MambaByte, which can assimilate far more data than transformers without the need for tokenization. MambaByte processes raw bytes representing text and exhibits competitive performance on language tasks while effectively managing “noise,” like swapped characters and inconsistent spacing.
However, models like MambaByte are still in experimental stages.
“It’s probably ideal for models to directly analyze characters without tokenization, though that remains computationally impractical for transformers today,” Feucht explains. “In the case of transformer models, computation scales quadratically with sequence length, thus prompting the need for shorter text representations.”
Unless breakthroughs in tokenization occur, new model architectures are likely to be pivotal in advancing generative AI capabilities.