The mathematics and concepts in LLM’s

Today, state-of-the-art LLMs like Gemini, Claude, DeepSeek, GPT etc exhibit extraordinary capabilities resembling aspects of human cognition. They are able to write poems, produce powerful codebases, summarize text etc. and interact with users in a multimodal way. Their performance is largely correlated to their architecture. An LLM architecture is a parametrized function that applies a sequence of mathematical operations on the embedded input tokens in order to model and predict the probability distributions over the next token. While there exist several LLM architectures, one very popular architecture is the transformer architecture. It was introduced in 2017 by google researchers in a paper called “Attention is all you need.” Mathematics is the invisible backbone behind these powerful systems.

Let’s dive into some of the mathematical concepts present in LLMs

Tokenization

LLM’s are trained on massive amounts of text some of which are extracted from html webpages. Tokenization in the context of LLM’s is the process of breaking down the binary equivalent of the text into fundamental units that the machine can understand and work with. Breaking the word allows the model to understand not just the full word but also the building blocks of the word. One popular tokenization algorithm is the Byte Pair Encoding (BPE). It merges the most frequent pairs of characters into larger units thereby increasing the vocabulary size while decreasing the overall number of tokens. This process is repeated a couple of times. Production grade LLM’s can have a large vocabulary size in the hundreds of thousands of tokens. There is an important trade-off that is made between the size of the vocabulary and the size of the training dataset as this has an impact on the production cost.

Embedding

This technique was first introduced by Thomas Mikolov with words2vec that transformed words into vectors in a mathematical space where the relationships between the words could be visualized and computed. Embedding is the process of representing tokens in a numerical vector space that captures their semantic meaning. In this high dimensional mathematical space, words with similar meanings are clustered together and their semantic understanding can be compared and contrasted. The word bat for example will have a meaning in this mathematical space such that it can appear next to flying mammals but also to its sports equipment equivalent. Embedding uses mathematical methods like cosine similarity or dot product among others.

Attention Mechanism / Multi-head Attention

The attention mechanism enables LLM’s to understand sentences / sequences as a whole. In the case of a sentence with a pronoun ambiguity for example – Sarah told Aisha that she won the award. Who won the award? Is it Sarah or Aisha? The attention mechanism enables LLM to understand every word in relation to all other words in the sentence. At its core each word evaluates its relevance in relation to every other word in the sentence. The mathematics of the attention mechanism makes use of queries, keys and values used in information retrieval where queries represent the word in focus, keys help identify the relevant words and values hold the associated information. Together they create attention scores, which determines the influence words have over each other. Dividing the attention scores with the square root of the dimension of the key ensures stable computation. The transformer architecture runs multiple attention operations in parallel, focusing on different relationships within the same sentence. This provides a richer understanding of context.

Position encoding

Just like the order of words in a sentence convey a particular meaning, the order of tokens in a token sequence need to be encoded as they convey information. Position encoding is used to inject information about token order. This is the technique by which transformers understand the order or position of words in a sentence or sequence. This is done by assigning each token with a positional representation. The position provides the model with an understanding of the structure of the sentence and how the words relate to each other based on their positions. It is like assigning a mathematical signature to each token. It can be sinusoidal functions or can be computed as parameters during the training phase and it gives the LLM the opportunity to understand order.

Gradient Descent

This is the optimization process by which the LLM learns the patterns in the training data and improves. It is the method used to adjust the models parameters or the settings to minimize the cost function which measures the difference between the models predictions and the target values. Using calculus, the algorithm calculates the gradient / slope of the cost function with respect to each parameter. This gradient indicates how much each parameter contributes to the overall error. The model then updates its parameters in the direction opposite that of the gradient, gradually reducing the error and converging towards an optimal solution i.e. a minima of the loss function.

Entropy

At its core, entropy is the measure of uncertainty. It defines how LLM’s make sense of an unpredictable world i.e. how the model balances through certainty and exploration to generate the next token, determining how unpredictable LLM’s are. For example asking chatgpt to finish the sentence “the DRC has the best …?” After some back and forth where I was advised by the model to change the wording to avoid vague claims, it finally finished the sentence with“ The DRC has the best untapped economic potential in Africa.” This is a low entropy response. Asking chatgpt for a high entropy response, its response was: “the DRC has the best thunder – rolling like ancient drums across a cobalt sky. In this case the response was more creative. Entropy balances between randomness and determinism. This is controlled by the temperature. High temperatures create leaps by the model while low temperatures predict coherent answers. In practical applications it allows LLM’s to be creative or not with their answers. A chatbot is more likely to have low entropy response while a story generator will have a high entropy to produce unexpected tweaks.

Scaling

The scaling law in LLMs shows that the model performance increases with the compute, data size and parameter count increase. Large LLMs like Claude, GPT with their trillions of parameters demonstrate the power of scaling. Think of a parameter or value within the neural network like a dial on a massive control panel that helps the model make decisions. The more parameters, the more precise and nuanced the model’s responses can be. The ability to predict finer details is the reason why performance increases as the model scales. There are certain challenges however to scaling. Beyond a certain point there are diminishing returns. While scaling from 1 million to 10 million might yield significant gains, scaling from 100 billion to 200 billion will just show marginal improvements while the cost of training the models grows exponentially requiring enormous computational resources and energy. The question now is how big is too big? As of 2026 GPt-5 is speculated to have well over 1.5 trillion parameters and is the top model on the agentic coding scoring first on terminal-bench 2.0

Overfitting

Scaling models present some risks, one of which is overfitting. This is when the model while training, memorizes data rather than learning from it and loses its ability to generalize when introduced to new input thus becoming rigid and unreliable. A solution to this is regularization. Regularization ensures the LLM learns more generalizable representations i.e. the underlying concepts and not just the specifics from the training data. This enables them to apply their knowledge in diverse situations and become better at inference. One common regularization technique is dropout. During training, dropout temporarily disables certain neurones. This forces the model to learn more robust patterns by preventing it from relying too heavily on specific patterns. Another essential regularization technique is weight decay. It ensures the model maintains balance by penalizing large weights and avoiding over confidence in any single pattern.

Persistent Memory

Memory is a critical component of understanding. Imagine a model that maintains multi-turn conversations and remembers the entirety of your conversations over years. Memory enables models to grasp long term dependencies, track narratives and maintain coherence in dialogues. Early model architecture limited their memory to a few context windows, often just a few thousand tokens. Beyond this earlier information faded. Innovations like sliding windows aimed at solving this problem. Unlike sliding windows, persistent memory enables models to retain critical information across tasks or sessions.

Multimodality

Bridging the gap between language and perception and creating versatile systems that are able to understand images, text and videos i.e. seeing, hearing and understanding. This is possible thanks to the concept of shared embedding. These embeddings serve as a common language for diverse data types enabling the model to connect videos, text and images together in a unified vector representation space with similar concepts clustered together. For example we might have the word piano appearing next to the image of a piano and the sound of a piano playing. This alignment allows the model to understand that these are all facets of the same idea. When presented with a multi-modal prompt the model uses the shared embeddings to integrate these elements together seamlessly. Early models like gpt-3 were text only designed to process written language in isolation.