A0002 · AI — FUNDAMENTALS

How LLMs actually think — and why it changes how you should use them

By Diego Diniz · May 13, 2026 · 6 min read

Everyone treats LLMs like magic boxes. Understanding how the mechanism works under the hood — attention, tokens, probability — radically changes how you should use them. A0002 · 2026

Key Takeaways

01LLMs do not think — they predict the most likely next token. Understanding this changes how you structure every interaction.
02Context is measured in tokens, not words. 128K tokens sounds like a lot, but quality degrades at the edges.
03Prompt structure matters more than word choice. Order and organization beat eloquence every single time.
04Temperature is the control almost nobody calibrates. Getting it wrong is like driving in the wrong gear.
05The practical test: before sending any prompt, ask whether the context is sufficient and in the right order.

Why understanding the mechanism matters

Most people treat LLMs like oracles: throw in a question, wait for an answer. Does it work? Sometimes. But it is like using a Formula 1 car just to grab coffee — you are not tapping into even 10% of its potential.

The problem is not the tool. It is the mental model of whoever is using it. And mental models only shift when you understand the mechanism.

What an LLM actually does (no marketing)

An LLM is a conditional probability machine. Given a sequence of tokens, it predicts the most likely next token. Period. It does not think, reason, or understand — at least not in the human sense.

But here is the point almost nobody talks about: not thinking like a human does not mean it is dumb. It means it works differently. And whoever understands that difference gets results that look like magic to those who do not.

Tokens: the unit that changes everything

You do not talk to an LLM in words. You talk in tokens — pieces of words, sometimes a full word, sometimes half. "Processing" becomes 2 tokens. "AI" is 1.

Why does this matter? Because the model context is measured in tokens, not words. When someone says "128K token context", they mean the model can hold roughly 100 thousand words in working memory. But careful: holding in memory does not mean processing with equal quality throughout.

Attention: where the magic happens (and where it breaks)

The attention mechanism is the heart of the transformer. It lets each token "look at" every other token in the context and decide which ones are relevant.

In practice, this means the order and structure of your prompt matter far more than the words you pick. A well-structured prompt with simple words beats an elaborate but poorly organized prompt every single time.

And here is the bottleneck nobody mentions: attention has quadratic cost. Doubling the context does not double processing — it quadruples it. That is why models with massive context windows get slower and less accurate at the edges.

Temperature and sampling: the control you are ignoring

When you adjust an LLM's "temperature", you are controlling how creative vs. deterministic it is. Temperature 0 means always the most probable token. Temperature 1 means the full probability distribution.

The most common mistake: using high temperature for tasks that need precision (data extraction, code) and low temperature for creative tasks (brainstorming, writing). It is the equivalent of being in the wrong gear.

What changes in practice once you get this

Three things change immediately:

Prompt structure becomes the priority. You stop optimizing words and start optimizing the information architecture going into the model.
You stop asking and start guiding. Instead of "give me the answer", you set up the context so the most probable output is already what you want.
You calibrate expectations. You know when the model will be good (tasks with clear patterns) and when it will struggle (multi-step logical reasoning, exact math).

The Monday test

Next time you open ChatGPT or Claude, before typing anything, ask yourself: "Am I giving enough context, in the right order, so that the most probable token is what I want?" If the answer is no, restructure before hitting send.

This is not about prompt engineering. It is about understanding the machine on the other side.

#llm #transformers #how-it-works #tokens #attention #applied-ai

FAQ

Do LLMs actually think or just repeat patterns?

Neither. LLMs compute conditional probabilities — given the context, what is the most likely next token. It is not brute repetition (the model generalizes patterns), but it is not reasoning in the human sense either.

Why does prompt structure matter more than word choice?

The attention mechanism processes relationships between every token in context. Well-organized information creates clearer patterns for the model, resulting in more precise responses — regardless of how fancy the vocabulary is.

What is temperature and when should I adjust it?

Temperature controls the randomness of token selection. Use low (0-0.3) for precise tasks like data extraction or code. Use high (0.7-1.0) for brainstorming and creative writing. The most common mistake is getting this backwards.

Sobre o autor

Diego Diniz

Nexialista & Redator

Nexialista que conecta disciplinas improváveis para criar coisas novas. Acredita que IA é a maior alavanca do mundo, mas só multiplica alguma coisa se você souber o que colocar do outro lado.

Seguir