byte5

A strange encounter

You’re chatting with a machine
that doesn’t understand a single word

Yet it writes code, summarises contracts, answers questions, and often sounds disturbingly smart.

How is that possible?

byte5

Let’s clear something up first

“AI” and LLMs are not the same thing

Classical AI

The old promise

AI research has been chasing one big goal since the 1950s: a machine that, like a human, understands and reasons, has a persistent consciousness, and pursues its own goals: the strong AI of pop culture.

Despite decades of research, the goal is still unsolved. Every phase of euphoria has so far been followed by a long winter.

LLMs

Something new

An LLM doesn’t think. It computes which sequence of words is statistically most likely to follow your input. The result often feels like understanding. It’s a simulation, though, not consciousness.

Precisely because the goal is smaller, things suddenly moved fast.

Both get casually called “AI”. If you want to understand LLMs, leave the strong-AI promise explicitly aside.

03 / 22

byte5

The crucial difference

Classical AI needs specialists
LLMs are generalists

Classical approach

Define the problem → design a matching architecture
Collect and label training data by hand
Train a model that does only this one thing
New problem? Start over.

LLM approach

Train one huge model on next-token prediction
Done. The same model handles anything expressible as text
Translate, code, reason, same underlying engine
New task? Just ask differently, or feed it more context.

Instead of a thousand small models for a thousand tasks: one large model as a universal language tool.

04 / 22

byte5

The basic idea

Computers work with numbers
So language becomes numbers

“Hello World”

Text, as we read it

→

[15496, 200, 25368]

Token IDs

→

[0.21, −0.04, 0.88,
… 4096 numbers …]

Vector (embedding)

Only after this translation can the model actually “do” anything. Let’s look at each step in turn.

05 / 22

byte5

Step 1: Tokenization

Text gets broken into building blocks

A token is usually a word fragment, not a whole word, not a single letter. That makes processing efficient: frequent words are a single token, rare ones get split into pieces.

Input text:

0 tokens Rule of thumb: 1 English word ≈ 1.3 tokens (German needs ~1.5–2 because compound words get split)

Note: The split shown here is a simplified demonstration. Real models use methods like Byte-Pair Encoding and learn from billions of sentences which fragments frequently co-occur.

06 / 22

byte5

Step 2: Embedding

Each token becomes a vector

A vector is a long list of numbers. Think of them as coordinates for a point in a space. In modern models that space has 4,000 to 12,000 dimensions, that many numbers per token.

These numbers aren’t random. They’re tuned during training so that related words end up in the same region of the space.

Meaning becomes geometry. Words with similar meaning sit close together; opposites sit far apart.

Simplified 2D projection of a vector space

07 / 22

byte5

Why this is so powerful

Doing math with meaning

King − Man + Woman ≈ Queen

Relationships between words (gender, plural, capital city…) become directions in the space. This classic example comes from Word2Vec (2013); in modern transformers the geometry is context-dependent, but the principle is the same. That’s how the model generalises to sentences that never appeared verbatim in the training data.

08 / 22

byte5

Step 3: The central task

The model does one thing: predict the next token

For every request, the LLM computes a probability for every possible token in its vocabulary, and then picks one.

Input

“The coffee is hot and I like to drink it most often in the ___”

Probabilities for the next token

Morning

Afternoon

Evening

Kitchen

Garden

That’s how whole answers come about, token by token. Click multiple times; the result is intentionally not the same every time.

09 / 22

byte5

Step 4: Architecture

The machine behind this is called the Transformer

The Transformer is the architecture used to build essentially every major LLM since 2017. Simplified: a layered structure that progressively transforms a stream of tokens (as vectors) until the probabilities come out the other end.

→

Tokens

Input

⬡

Embeddings

Vectors

⌬

N× Attention

Layers

∑

Logits

Raw values

Probabilities

Token pick

The core piece is the attention layers, typically 30 to 100 of them, stacked. With each layer the model’s understanding of the text gets a little deeper.

10 / 22

byte5

The one trick that changed everything

Attention: Every token looks at all previous ones

At each layer, for each token, the model decides: which of the previous tokens matter to me right now? That’s attention. (Encoder models like BERT look in both directions; modern chat LLMs are decoder-only and only look back.)

Example: Which word does “she” refer to?

The cat chases the mouse , because she is hungry.

“she” can refer back to either “cat” or “mouse”. The model weighs both.

That’s how the model can grasp a whole paragraph at once and doesn’t forget what was at the beginning.

11 / 22

byte5

Where we are

An LLM is, at the end of the day, a huge table of numbers

These numbers are called parameters or weights. They encode everything the model “knows”: grammar, facts, style, code idioms.

1–10 B

Small / local

70–100 B

Mid-tier

1–3 T

Top-tier (estimated)

Every one of these parameters gets gradually tuned during training. How? That’s the next act.

12 / 22

byte5

Phase 1: Pretraining

Half the Internet as a teacher

In pretraining the model gets a vast text corpus (webpages, books, Wikipedia, code, forums, papers) and a single task.

“Predict the next token for me.”

Billions of times. Trillions of tokens. Side effect: to make the next token fit well, the model has to pick up grammar, facts, and logical reasoning along the way, not because anyone told it to, but because without those skills it couldn’t guess well enough.

Order of magnitude

~15 trillion

training tokens

10,000 to 100,000+

specialised GPUs in parallel

weeks to months

of continuous compute

~$100M to $1B

per top-model run

13 / 22

byte5

But careful

The raw model isn’t an assistant
It’s autocomplete on steroids

After pretraining, the model has learned to continue Internet text, not to helpfully answer a specific question. A simple question to the raw model makes that clear:

Question

“What’s the capital of France?”

Raw model replies

“What’s the capital of Italy? What’s the capital of Spain?…”

Question

“What is 2+2?”

Raw model replies

“A simple math question, often asked by second-grade teachers.”

The Internet is full of lists of similar questions, so the question gets continued as a list. Logical from a next-token perspective, useless as an answer.

14 / 22

byte5

Phase 2: Fine-tuning

Teaching it some manners

In fine-tuning, the raw model gets further trained on a curated dataset of question-answer pairs. The pairs show the model the desired shape: question in, helpful answer out.

What’s the capital of France?

The capital of France is Paris.

Data

Today, millions of examples: human-written seed data and guidelines, but the bulk is synthetic, generated by stronger models (distillation).

Effect

The model learns the format, not new facts. Knowledge comes from pretraining; fine-tuning only shifts behaviour.

15 / 22

byte5

Phase 3: Learning from feedback

RLHF — humans choose, the model adapts

Reinforcement Learning from Human Feedback. Sounds complicated; isn’t: the model writes several answers, someone says which is better, the model gets nudged in that direction.

Generate answers

The model answers the same question several times, in different variants.

Rate the answers

Humans define the criteria. Increasingly, AI models handle the bulk of the comparisons (RLAIF, Constitutional AI).

Adjust the model

Answers that often won become more likely; the others, less so.

RLHF turns a well-read model into a useful tool. Politeness, clarity, safety guardrails: all of that gets tuned here.

16 / 22

byte5

In production

Inference: What happens when you type

The trained model is hosted on a server. When you send a request, your text moves through these stages, token by token, in real time.

1Your text gets tokenised—

2Tokens become embeddings—

3Attention layers process the context—

4Probabilities for the next token are computed—

5Pick a token → emit → loop back to step 3—

The loop from steps 3–5 repeats for every word of the answer. That’s why you see answers pop up token by token.

17 / 22

03

Part 3 of 4

From model to agent

What you experience today as “ChatGPT” or “Claude” is no longer just a model. It’s a model with memory, tools, a plan, and a deliberately crafted personality. How Claude comes across, how GPT “feels”, isn’t a training accident; it’s agent design by the provider.

byte5

Three stages, one trend

Model → Chat assistant → Agent

Language model

Takes text, returns text. No memory of the previous request. A pure function.

Chat assistant

Model + conversation memory + system instructions. Keeps the conversation in context.

Agent

Chat + role & behaviour brief + tools (web, files, code, APIs). Plans multiple steps, checks results, self-corrects.

Tools are functions

During training the model learns when to call a function (e.g. web.search("…") or db.query("…")) and how to read the result.

Loop instead of single answer

An agent runs in a loop: plan → tool → read result → new plan. That’s how multi-step tasks like “book me the flight” happen.

18 / 22

byte5

The reveal

Why did LLMs suddenly appear?

The idea of training a neural network on the next word is old. Three things had to come together to turn the idea into a useful system.

Architecture

The Transformer (2017) is parallelisable. Earlier architectures had to compute word by word, a training bottleneck.

Data

The Internet as a corpus. For the first time, enough text in machine-readable form to meaningfully train billions of parameters.

Hardware

GPUs, originally built for graphics, turn out to be well suited to matrix math, which is exactly what Transformers need.

Scale makes the difference. Many of today’s model capabilities (multi-step reasoning, writing code, multilingual answers) need a certain threshold of model and data size. That threshold drops with better data and training methods: what only 175B models could do two years ago, an 8B model can sometimes already manage today.

19 / 22

byte5

Being honest

What LLMs cannot do

Guarantee truth

An LLM guesses the most likely next token. That’s often right, but not because it’s right; just because it sounds plausible. For facts: always verify.

Learn in real time

Its knowledge is frozen at the moment of training. It only sees current data through tools: web search, database access, RAG.

Long logic chains

In long proof and computation chains (complex maths, formal logic), each step makes a small mistake, and the mistakes compound.

Understand like a human

What looks like understanding is an extremely well-learned statistical pattern: impressively useful, but not consciousness.

20 / 22

byte5

In practice

What this means for your business

Language tasks are now a commodity.

Summarising, classifying, translating: capability that used to need specialist teams is now an API call away.

The value lives in the context.

The base model is the same everywhere. What matters is the connection to your data, processes and tools.

Agents need guardrails.

Tool use, permissions, error handling: classical software engineering, just with a non-deterministic actor.

The right questions to ask yourself

Where in our processes is language the bottleneck?

Which of our data may an LLM see, and under what conditions?

What happens when the model gets it wrong?

What business value justifies what effort?

21 / 22

byte5

Who we are

We build Agentic AI Solutions
for business-critical scenarios

byte5 is a Frankfurt-based software company with over 20 years of experience, specialised in AI solutions that hold up at SMBs and enterprises when accuracy, security and integration are what matter.

LLM integration

From prototype to production-grade, verified application in your infrastructure.

Agent systems

Tool integration, permissions, guardrails, observability. Engineering, not demos.

Strategy & roadmap

Use-case assessment, data protection, architecture. We help you tell where it pays off, and where it doesn’t.

Schedule a call Learn more about byte5.ai

22 / 22

How LLMsactually work

You’re chatting with a machinethat doesn’t understand a single word

“AI” and LLMs are not the same thing

The old promise

Something new

Classical AI needs specialistsLLMs are generalists

Classical approach

LLM approach

01

The building blocks

Computers work with numbersSo language becomes numbers

Text gets broken into building blocks

Each token becomes a vector

Doing math with meaning

The model does one thing: predict the next token

The machine behind this is called the Transformer

Attention: Every token looks at all previous ones

An LLM is, at the end of the day, a huge table of numbers

02

How the model learns

Half the Internet as a teacher

The raw model isn’t an assistantIt’s auto­complete on steroids

Teaching it some manners

Data

Effect

RLHF — humans choose, the model adapts

Generate answers

Rate the answers

Adjust the model

Inference: What happens when you type

03

From model to agent

Model → Chat assistant → Agent

Language model

Chat assistant

Agent

Tools are functions

Loop instead of single answer

Why did LLMs suddenly appear?

Architecture

Data

Hardware

What LLMs cannot do

Guarantee truth

Learn in real time

Long logic chains

Understand like a human

What this means for your business

The right questions to ask yourself

We build Agentic AI Solutionsfor business-critical scenarios

How LLMs
actually work

You’re chatting with a machine
that doesn’t understand a single word

Classical AI needs specialists
LLMs are generalists

Computers work with numbers
So language becomes numbers

The raw model isn’t an assistant
It’s autocomplete on steroids

We build Agentic AI Solutions
for business-critical scenarios