How modern language models really function: from tokens to agents
Yet it writes code, summarises contracts, answers questions, and often sounds disturbingly smart.
How is that possible?
AI research has been chasing one big goal since the 1950s: a machine that, like a human, understands and reasons, has a persistent consciousness, and pursues its own goals: the strong AI of pop culture.
Despite decades of research, the goal is still unsolved. Every phase of euphoria has so far been followed by a long winter.
An LLM doesn’t think. It computes which sequence of words is statistically most likely to follow your input. The result often feels like understanding. It’s a simulation, though, not consciousness.
Precisely because the goal is smaller, things suddenly moved fast.
Both get casually called “AI”. If you want to understand LLMs, leave the strong-AI promise explicitly aside.
Instead of a thousand small models for a thousand tasks: one large model as a universal language tool.
What actually happens when you type a question into the input field and hit “Send”? Four steps, each one standard engineering today.
Only after this translation can the model actually “do” anything. Let’s look at each step in turn.
A token is usually a word fragment, not a whole word, not a single letter. That makes processing efficient: frequent words are a single token, rare ones get split into pieces.
Note: The split shown here is a simplified demonstration. Real models use methods like Byte-Pair Encoding and learn from billions of sentences which fragments frequently co-occur.
A vector is a long list of numbers. Think of them as coordinates for a point in a space. In modern models that space has 4,000 to 12,000 dimensions, that many numbers per token.
These numbers aren’t random. They’re tuned during training so that related words end up in the same region of the space.
Meaning becomes geometry. Words with similar meaning sit close together; opposites sit far apart.
Relationships between words (gender, plural, capital city…) become directions in the space. This classic example comes from Word2Vec (2013); in modern transformers the geometry is context-dependent, but the principle is the same. That’s how the model generalises to sentences that never appeared verbatim in the training data.
For every request, the LLM computes a probability for every possible token in its vocabulary, and then picks one.
That’s how whole answers come about, token by token. Click multiple times; the result is intentionally not the same every time.
The Transformer is the architecture used to build essentially every major LLM since 2017. Simplified: a layered structure that progressively transforms a stream of tokens (as vectors) until the probabilities come out the other end.
The core piece is the attention layers, typically 30 to 100 of them, stacked. With each layer the model’s understanding of the text gets a little deeper.
At each layer, for each token, the model decides: which of the previous tokens matter to me right now? That’s attention. (Encoder models like BERT look in both directions; modern chat LLMs are decoder-only and only look back.)
That’s how the model can grasp a whole paragraph at once and doesn’t forget what was at the beginning.
These numbers are called parameters or weights. They encode everything the model “knows”: grammar, facts, style, code idioms.
Every one of these parameters gets gradually tuned during training. How? That’s the next act.
So far we’ve only seen the architecture, an empty machine. Now we fill those billions of parameters with knowledge.
In pretraining the model gets a vast text corpus (webpages, books, Wikipedia, code, forums, papers) and a single task.
Billions of times. Trillions of tokens. Side effect: to make the next token fit well, the model has to pick up grammar, facts, and logical reasoning along the way, not because anyone told it to, but because without those skills it couldn’t guess well enough.
After pretraining, the model has learned to continue Internet text, not to helpfully answer a specific question. A simple question to the raw model makes that clear:
The Internet is full of lists of similar questions, so the question gets continued as a list. Logical from a next-token perspective, useless as an answer.
In fine-tuning, the raw model gets further trained on a curated dataset of question-answer pairs. The pairs show the model the desired shape: question in, helpful answer out.
Today, millions of examples: human-written seed data and guidelines, but the bulk is synthetic, generated by stronger models (distillation).
The model learns the format, not new facts. Knowledge comes from pretraining; fine-tuning only shifts behaviour.
Reinforcement Learning from Human Feedback. Sounds complicated; isn’t: the model writes several answers, someone says which is better, the model gets nudged in that direction.
The model answers the same question several times, in different variants.
Humans define the criteria. Increasingly, AI models handle the bulk of the comparisons (RLAIF, Constitutional AI).
Answers that often won become more likely; the others, less so.
RLHF turns a well-read model into a useful tool. Politeness, clarity, safety guardrails: all of that gets tuned here.
The trained model is hosted on a server. When you send a request, your text moves through these stages, token by token, in real time.
The loop from steps 3–5 repeats for every word of the answer. That’s why you see answers pop up token by token.
What you experience today as “ChatGPT” or “Claude” is no longer just a model. It’s a model with memory, tools, a plan, and a deliberately crafted personality. How Claude comes across, how GPT “feels”, isn’t a training accident; it’s agent design by the provider.
Takes text, returns text. No memory of the previous request. A pure function.
Model + conversation memory + system instructions. Keeps the conversation in context.
Chat + role & behaviour brief + tools (web, files, code, APIs). Plans multiple steps, checks results, self-corrects.
During training the model learns when to call a function (e.g. web.search("…") or db.query("…")) and how to read the result.
An agent runs in a loop: plan → tool → read result → new plan. That’s how multi-step tasks like “book me the flight” happen.
The idea of training a neural network on the next word is old. Three things had to come together to turn the idea into a useful system.
The Transformer (2017) is parallelisable. Earlier architectures had to compute word by word, a training bottleneck.
The Internet as a corpus. For the first time, enough text in machine-readable form to meaningfully train billions of parameters.
GPUs, originally built for graphics, turn out to be well suited to matrix math, which is exactly what Transformers need.
Scale makes the difference. Many of today’s model capabilities (multi-step reasoning, writing code, multilingual answers) need a certain threshold of model and data size. That threshold drops with better data and training methods: what only 175B models could do two years ago, an 8B model can sometimes already manage today.
An LLM guesses the most likely next token. That’s often right, but not because it’s right; just because it sounds plausible. For facts: always verify.
Its knowledge is frozen at the moment of training. It only sees current data through tools: web search, database access, RAG.
In long proof and computation chains (complex maths, formal logic), each step makes a small mistake, and the mistakes compound.
What looks like understanding is an extremely well-learned statistical pattern: impressively useful, but not consciousness.
Summarising, classifying, translating: capability that used to need specialist teams is now an API call away.
The base model is the same everywhere. What matters is the connection to your data, processes and tools.
Tool use, permissions, error handling: classical software engineering, just with a non-deterministic actor.
byte5 is a Frankfurt-based software company with over 20 years of experience, specialised in AI solutions that hold up at SMBs and enterprises when accuracy, security and integration are what matter.