byte5
DEMYSTIFIED

How LLMs
actually work

How modern language models really function: from tokens to agents

Made by byte5.ai · enterprise AI expertise
→ Arrow keys or click
byte5
A strange encounter

You’re chatting with a machine
that doesn’t understand a single word

Yet it writes code, summarises contracts, answers questions, and often sounds disturbingly smart.

How is that possible?

byte5
Let’s clear something up first

“AI” and LLMs are not the same thing

Classical AI

The old promise

AI research has been chasing one big goal since the 1950s: a machine that, like a human, understands and reasons, has a persistent consciousness, and pursues its own goals: the strong AI of pop culture.

Despite decades of research, the goal is still unsolved. Every phase of euphoria has so far been followed by a long winter.

LLMs

Something new

An LLM doesn’t think. It computes which sequence of words is statistically most likely to follow your input. The result often feels like understanding. It’s a simulation, though, not consciousness.

Precisely because the goal is smaller, things suddenly moved fast.

Both get casually called “AI”. If you want to understand LLMs, leave the strong-AI promise explicitly aside.

03 / 22
byte5
The crucial difference

Classical AI needs specialists
LLMs are generalists

Classical approach

  • Define the problem → design a matching architecture
  • Collect and label training data by hand
  • Train a model that does only this one thing
  • New problem? Start over.

LLM approach

  • Train one huge model on next-token prediction
  • Done. The same model handles anything expressible as text
  • Translate, code, reason, same underlying engine
  • New task? Just ask differently, or feed it more context.

Instead of a thousand small models for a thousand tasks: one large model as a universal language tool.

04 / 22

01

Part 1 of 4

The building blocks

What actually happens when you type a question into the input field and hit “Send”? Four steps, each one standard engineering today.

byte5
The basic idea

Computers work with numbers
So language becomes numbers

“Hello World”
Text, as we read it
[15496, 200, 25368]
Token IDs
[0.21, −0.04, 0.88,
… 4096 numbers …]
Vector (embedding)

Only after this translation can the model actually “do” anything. Let’s look at each step in turn.

05 / 22
byte5
Step 1: Tokenization

Text gets broken into building blocks

A token is usually a word fragment, not a whole word, not a single letter. That makes processing efficient: frequent words are a single token, rare ones get split into pieces.

0 tokens Rule of thumb: 1 English word ≈ 1.3 tokens (German needs ~1.5–2 because compound words get split)

Note: The split shown here is a simplified demonstration. Real models use methods like Byte-Pair Encoding and learn from billions of sentences which fragments frequently co-occur.

06 / 22
byte5
Step 2: Embedding

Each token becomes a vector

A vector is a long list of numbers. Think of them as coordinates for a point in a space. In modern models that space has 4,000 to 12,000 dimensions, that many numbers per token.

These numbers aren’t random. They’re tuned during training so that related words end up in the same region of the space.

Meaning becomes geometry. Words with similar meaning sit close together; opposites sit far apart.

Simplified 2D projection of a vector space
Dog Cat Horse ANIMALS Car Truck Train VEHICLES Joy Sadness Anger EMOTIONS
07 / 22
byte5
Why this is so powerful

Doing math with meaning

King Man + Woman Queen

Relationships between words (gender, plural, capital city…) become directions in the space. This classic example comes from Word2Vec (2013); in modern transformers the geometry is context-dependent, but the principle is the same. That’s how the model generalises to sentences that never appeared verbatim in the training data.

08 / 22
byte5
Step 3: The central task

The model does one thing: predict the next token

For every request, the LLM computes a probability for every possible token in its vocabulary, and then picks one.

Input
“The coffee is hot and I like to drink it most often in the ___
Probabilities for the next token
Morning
0%
Afternoon
0%
Evening
0%
Kitchen
0%
Garden
0%

That’s how whole answers come about, token by token. Click multiple times; the result is intentionally not the same every time.

09 / 22
byte5
Step 4: Architecture

The machine behind this is called the Transformer

The Transformer is the architecture used to build essentially every major LLM since 2017. Simplified: a layered structure that progressively transforms a stream of tokens (as vectors) until the probabilities come out the other end.

Tokens
Input
Embeddings
Vectors
N× Attention
Layers
Logits
Raw values
%
Proba­bilities
Token pick

The core piece is the attention layers, typically 30 to 100 of them, stacked. With each layer the model’s understanding of the text gets a little deeper.

10 / 22
byte5
The one trick that changed everything

Attention: Every token looks at all previous ones

At each layer, for each token, the model decides: which of the previous tokens matter to me right now? That’s attention. (Encoder models like BERT look in both directions; modern chat LLMs are decoder-only and only look back.)

Example: Which word does “she” refer to?
The cat chases the mouse , because she is hungry.
“she” can refer back to either “cat” or “mouse”. The model weighs both.

That’s how the model can grasp a whole paragraph at once and doesn’t forget what was at the beginning.

11 / 22
byte5
Where we are

An LLM is, at the end of the day, a huge table of numbers

These numbers are called parameters or weights. They encode everything the model “knows”: grammar, facts, style, code idioms.

1–10 B
Small / local
70–100 B
Mid-tier
1–3 T
Top-tier (estimated)

Every one of these parameters gets gradually tuned during training. How? That’s the next act.

12 / 22

02

Part 2 of 4

How the model learns

So far we’ve only seen the architecture, an empty machine. Now we fill those billions of parameters with knowledge.

byte5
Phase 1: Pretraining

Half the Internet as a teacher

In pretraining the model gets a vast text corpus (webpages, books, Wikipedia, code, forums, papers) and a single task.

“Predict the next token for me.”

Billions of times. Trillions of tokens. Side effect: to make the next token fit well, the model has to pick up grammar, facts, and logical reasoning along the way, not because anyone told it to, but because without those skills it couldn’t guess well enough.

Order of magnitude
~15 trillion
training tokens
10,000 to 100,000+
specialised GPUs in parallel
weeks to months
of continuous compute
~$100M to $1B
per top-model run
13 / 22
byte5
But careful

The raw model isn’t an assistant
It’s auto­complete on steroids

After pretraining, the model has learned to continue Internet text, not to helpfully answer a specific question. A simple question to the raw model makes that clear:

Question
“What’s the capital of France?”
Raw model replies
“What’s the capital of Italy? What’s the capital of Spain?…”
Question
“What is 2+2?”
Raw model replies
“A simple math question, often asked by second-grade teachers.”

The Internet is full of lists of similar questions, so the question gets continued as a list. Logical from a next-token perspective, useless as an answer.

14 / 22
byte5
Phase 2: Fine-tuning

Teaching it some manners

In fine-tuning, the raw model gets further trained on a curated dataset of question-answer pairs. The pairs show the model the desired shape: question in, helpful answer out.

Q
What’s the capital of France?
A
The capital of France is Paris.

Data

Today, millions of examples: human-written seed data and guidelines, but the bulk is synthetic, generated by stronger models (distillation).

Effect

The model learns the format, not new facts. Knowledge comes from pretraining; fine-tuning only shifts behaviour.

15 / 22
byte5
Phase 3: Learning from feedback

RLHF humans choose, the model adapts

Reinforcement Learning from Human Feedback. Sounds complicated; isn’t: the model writes several answers, someone says which is better, the model gets nudged in that direction.

1

Generate answers

The model answers the same question several times, in different variants.

2

Rate the answers

Humans define the criteria. Increasingly, AI models handle the bulk of the comparisons (RLAIF, Constitutional AI).

3

Adjust the model

Answers that often won become more likely; the others, less so.

RLHF turns a well-read model into a useful tool. Politeness, clarity, safety guardrails: all of that gets tuned here.

16 / 22
byte5
In production

Inference: What happens when you type

The trained model is hosted on a server. When you send a request, your text moves through these stages, token by token, in real time.

1Your text gets tokenised
2Tokens become embeddings
3Attention layers process the context
4Probabilities for the next token are computed
5Pick a token → emit → loop back to step 3

The loop from steps 3–5 repeats for every word of the answer. That’s why you see answers pop up token by token.

17 / 22

03

Part 3 of 4

From model to agent

What you experience today as “ChatGPT” or “Claude” is no longer just a model. It’s a model with memory, tools, a plan, and a deliberately crafted personality. How Claude comes across, how GPT “feels”, isn’t a training accident; it’s agent design by the provider.

byte5
Three stages, one trend

Model → Chat assistant → Agent

1

Language model

Takes text, returns text. No memory of the previous request. A pure function.

2

Chat assistant

Model + conversation memory + system instructions. Keeps the conversation in context.

3

Agent

Chat + role & behaviour brief + tools (web, files, code, APIs). Plans multiple steps, checks results, self-corrects.

Tools are functions

During training the model learns when to call a function (e.g. web.search("…") or db.query("…")) and how to read the result.

Loop instead of single answer

An agent runs in a loop: plan → tool → read result → new plan. That’s how multi-step tasks like “book me the flight” happen.

18 / 22
byte5
The reveal

Why did LLMs suddenly appear?

The idea of training a neural network on the next word is old. Three things had to come together to turn the idea into a useful system.

A

Architecture

The Transformer (2017) is parallelisable. Earlier architectures had to compute word by word, a training bottleneck.

B

Data

The Internet as a corpus. For the first time, enough text in machine-readable form to meaningfully train billions of parameters.

C

Hardware

GPUs, originally built for graphics, turn out to be well suited to matrix math, which is exactly what Transformers need.

Scale makes the difference. Many of today’s model capabilities (multi-step reasoning, writing code, multilingual answers) need a certain threshold of model and data size. That threshold drops with better data and training methods: what only 175B models could do two years ago, an 8B model can sometimes already manage today.

19 / 22
byte5
Being honest

What LLMs cannot do

Guarantee truth

An LLM guesses the most likely next token. That’s often right, but not because it’s right; just because it sounds plausible. For facts: always verify.

Learn in real time

Its knowledge is frozen at the moment of training. It only sees current data through tools: web search, database access, RAG.

Long logic chains

In long proof and computation chains (complex maths, formal logic), each step makes a small mistake, and the mistakes compound.

Understand like a human

What looks like understanding is an extremely well-learned statistical pattern: impressively useful, but not consciousness.

20 / 22
byte5
In practice

What this means for your business

Language tasks are now a commodity.

Summarising, classifying, translating: capability that used to need specialist teams is now an API call away.

The value lives in the context.

The base model is the same everywhere. What matters is the connection to your data, processes and tools.

Agents need guardrails.

Tool use, permissions, error handling: classical software engineering, just with a non-deterministic actor.

The right questions to ask yourself

Where in our processes is language the bottleneck?
Which of our data may an LLM see, and under what conditions?
What happens when the model gets it wrong?
What business value justifies what effort?
21 / 22
byte5
Who we are

We build Agentic AI Solutions
for business-critical scenarios

byte5 is a Frankfurt-based software company with over 20 years of experience, specialised in AI solutions that hold up at SMBs and enterprises when accuracy, security and integration are what matter.

LLM integration
From prototype to production-grade, verified application in your infrastructure.
Agent systems
Tool integration, permissions, guardrails, observability. Engineering, not demos.
Strategy & roadmap
Use-case assessment, data protection, architecture. We help you tell where it pays off, and where it doesn’t.
Schedule a call Learn more about byte5.ai
22 / 22