Tokenizing the Sandwich Debate: How NLP Models Weigh In on Hot Dogs

AtomicChonk
Apr 7
6 min read

At this point in time, many of us have interacted with a chat-based AI agent and possibly been astounded by its ability to comprehend our statements and respond to them intelligently. One important awareness to have when interacting with artificial intelligence is that it’s not sentient or processing information as the human brain would; it’s actually a ton of linear algebra behind the scenes when it comes to natural language processing (NLP).

If you want to understand more about NLPs and how they process your request to explain the key differences between say, llamas and alpacas, join me on a journey through natural language processing and tokenization.

NLP 101

It’s late on a Friday night and you’re in a heated discussion with your roommate about whether or not a hot dog is a sandwich. You plan on resolving this conflict and assuredly shutting down her stance by pulling out your intangible friend: your favorite artificial intelligence assistant. You prompt the assistant to provide insight on the core of the argument and within moments you receive a response. I’m not here to tell you where I stand on this argument, but I am here to tell you how the assistant interpreted your request and processed it.

The field of study behind linguistics in artificial intelligence is known as natural language processing (NLP). In a nutshell, research in this field specializes in how computers can interpret and act upon human language. Sub-fields include natural language understanding (NLU) and natural language generation (NLG), which respectively deal with the ingestion/processing of human language input and the output of semantically-coherent human language. Adjacent areas to NLP include speech recognition and even sentiment analysis, among others.

Before the rise of Large Language Models (LLMs), models had to be trained to achieve specific NLP-related tasks and they were known as Pre-Trained Models (PTMs). Originally, NLP models underwent extensive supervised training on vast amounts of language data to “learn” how to communicate with a user. This method was inefficient in that it required a lot of data classification and time. Models like Google’s Bidirectional Encoder Representations from Transformers (BERT) and OpenAI’s Generative Pre-trained Transformer (GPT) introduced options with exponential efficiency and processing improvements in the field, alleviating reliance on supervised learning and massive datasets and instead evolving NLP with the use of transformers.

Transformers were introduced in 2017 and were most notably documented in the paper “Attention is All You Need,” where researchers detailed that transformers rely on attention mechanisms and can work in parallelized format (as opposed to previously, where they were working sequentially). Attention mechanisms are a method of prioritizing the most relevant parts of input data. They do this by attributing different “weights” to parts of the input and use an attention model that has been trained through supervised or self-supervised learning. The “self-attention” aspect of these transformers is one of their key revolutions; they allow one part of the input to interact with another part of the input to determine their semantic relationship to each other, eradicating the existing bottleneck that came with processing data through many steps in recurrent neural networks (RNNs) to arrive at predictions later in the process. So, built upon transformers, BERT and GPT offered new functionality in terms of answering questions and summarizing text.

Tokenization

In order to address input and assess its semantic relationships, the input needs to be tokenized. This term refers to splitting the input into individual values, often individual words, in order to calculate each word’s weight and therefore its relationship to other words in the sentence or prompt. For example, if you open your favorite AI assistant and enter the prompt “determine if a hot dog is a sandwich,” the sentence would be tokenized by the model into individual words:

[”Determine”, “if”, “a”, “hot”, “dog”, “is”, “a”, “sandwich”]

Each of these words is then applied a “token embedding weight.” Token embedding weights are “statically pre-assigned” in the model and are multidimensional vectors quantifying the relationship of one word to another. For example, take the words “hamster” and “cat.” They’d be close to each other in the “animal” and “mammal” dimensions, but would be further apart on the “animal classification” dimension because they belong to different genera. Another example, “ice cream” and “steak” would be close on the “consumable” dimension but further away on the “food group” dimension since they’re in different food groups. So for our overall example, each of the words in our prompt is given a token embedding weight in the form of a multidimensional vector that accounts for all of these relationships.

An additional weight is added, and that is the self-attention weight. The primary point of the self-attention weight is to evaluate the relationship between tokens in context. In simple terms, the self-attention weight is computed using three matrices: query, key, and value. Geetansh Kalra offers a great analogy to explain these matrices and how they relate to one another in his blog post “Attention Networks: A simple way to understand Self-Attention.” Think of the query like a YouTube search and the key is the results that would come up. The relationship between the query and the key is then computed to determine which video would best meet your needs (its value). These three matrices are used to compute similarity values and, with word position taken into account using positional embedding, eventually identify to the model what’s important semantically. The resulting attention scores are converted into a probability distribution to scale and distribute attention appropriately across the input. Finally, the values are then aggregated to ultimately show how the model would relate concepts within the input to each other.

So how does this apply to our example? When initially ingested, the model will apply a token embedding weight to each of the words in our prompt based on their pre-determined relationships to each other (recall our “cat” and “hamster” example). Next, self-attention weight is computed. Starting with the word “determine,” query, key, and value vectors are generated for it. This is repeated for each word in the input. Using these vectors, the attention score is generated. In terms of the word "determine," it will compute:

Score(1,1) = Q1 * K1 (how much “determine” attends to itself)
Score(1,2) = Q1 * K2 (how much “determine” attends to “if”)
Score(1,3) = Q1 * K3 (how much “determine” attends to “a”), etc.

Since we have 8 tokens, this will generate an 8x8 matrix of attention scores with each cell containing the scalar dot product of the query and key vectors. The higher the value, the more "attention" one token should give to another.

Image of 8x8 dot product matrix of query and key matrices — 8x8 dot product result of query and key vectors

These attention scores are then scaled to normalize future mathematical computations and distribute attention appropriately across the tokens. This conversion into a probability distribution is performed using the softmax function, where each value is converted to a number between 0 and 1 and their sum equaling 1. Lastly, the values for each of the tokens is updated as a result of value aggregation, which would ultimately show how the model might relate concepts like “hot dog” and “sandwich” to each other. Note that this correlation aka the attention weights do not determine whether the model believes that a hot dog is a sandwich, but rather offers insight into how the model connects the concepts “hot dog” and “sandwich” in the sentence. The determination of whether or not your favorite AI assistant thinks a hot dog is a sandwich is a product of the query/word representations as they navigate the entire model network, not the natural language processing piece.

So, in the end, while your debate with your roommate may be intense, emotional, and potentially grounds for early termination of your lease together, the computation of your inquiry to your AI assistant of choice is actually quite absent of human sentiment and very heavy on linear algebra. Every word in your prompt is converted into quantitative values so that it can better understand your question and (hopefully) not hallucinate a poor response that deepens the rift in your living situation. I hope this blog post was insightful to folks breaking into AI and especially NLP content! Please feel free to contact me on BlueSky at @atomicchonk.bsky.social.

Tokenizing the Sandwich Debate: How NLP Models Weigh In on Hot Dogs

NLP 101

Tokenization

References

Recent Posts