From Mazes to Minds: How Trees, Keys, Queries, and Attention Shape Modern AI

Sigmoiding a "city with uncertain walls" - from decision trees to neural networks

Dec 28, 2025

Exploring Cities with Uncertain Walls: How AI Learns to Navigate Knowledge

🧠 Introduction — From Mazes to Living Cities

Imagine entering a city where walls and streets aren’t fixed. Some corridors may appear or vanish, doors open partially or fully, and each decision about which street to take is uncertain.

This is like modern AI models: given a huge input space, the system must decide which paths to follow, which experts to consult, and how much attention to pay, often without knowing the full structure in advance.Introduction — The Brain vs the Maze

Alternate intro (usually there are alternate endings but…)

Imagine wandering a giant maze. At each fork, you decide which way to go based on clues you’ve learned. This is like a decision tree — each intersection is a test, and every path leads to a conclusion.

Modern AI models, like large language models (LLMs), face a similar challenge: given a huge input space, how do they decide which paths to follow, which experts to consult, and how much attention to pay?

This article ties together:

Decision trees and their neural network analogues
Sigmoid scaling and soft/hard decisions
Attention (query/key/value)
Ensembles and mixture-of-experts
Routing in LLMs

All explained through a city with uncertain walls or maze metaphor and illustrated with intuition hinges.

🌆 Section 1 — Streets, Buildings, and Knowledge

In a normal maze, paths are fixed. In our city:

Streets may shift, walls may appear or vanish
Each decision point is uncertain
You have to learn which streets are worth exploring
or

🌳 Section 1 — Trees: The Maze of Decisions

A decision tree looks like a maze where each decision test splits the world:

At each node, we ask “Is this true?”
If yes → go one way; if no → another
Only one path is active for any input

Maze Analogy:
Each fork is a test, and once you pick a corridor, you exclude all others — this is why trees are sparse.

🧩 Weights & Biases: The Hidden Knowledge

A tree’s structure is just the skeleton. To navigate it, the model needs weights and biases — the hidden knowledge of which doors to push and when.

Weights (w) → determine which features matter at this decision fork (the “key” to open the door)
Biases (b) → determine the threshold or starting point where the decision flips (the “door handle height”)

Mathematically:

h(x)=σ(α(w⋅x+b))

City with Uncertain walls Analogy:

Weights (w) → which streets matter (features that guide you)
Biases (b) → where a decision flips (threshold for opening a street)
Scaling (α) → how decisively you trust a predicted street (soft/hard doors)

Without learning these, the city is just a fog — you don’t know which way leads anywhere.

Maze Analogy:

Weights = levers that determine which corridor matters
Bias = handle height — where you have to push the lever to open the door
Scaling α = how sharply the door swings open

Without weights and bias, the maze is just walls. Together, they encode the actual knowledge of which paths to take.

This also explains why, if the tree is unknown, the network must learn weights and biases. Layers alone give capacity, but weights and biases encode the actual decision knowledge.

🔁 Section 2 — Scaling: Doors That Swing Soft or Hard

Once we understand weights and bias as knowledge, we can discuss how decisively a door swings.

A sigmoid activation models a soft switch:

\(σ(x) = \frac{1}{1 + e^{-x}}\)

Multiplying by α controls steepness:

σ(α(w⋅x+b))

Small α → soft door: multiple paths partially open
Large α → hard door: path is almost fully open or fully closed, mimicking a real tree branch

Together, weights + biases + scaling = the learned guidance system for navigating the maze.

💡 Section 3 — Enter Attention: Spotlight in the Maze

Now imagine instead of picking one path, you can shine a spotlight on multiple corridors at once. This is attention in Transformers.

Query / Key / Value analogy:

Query (Q) = the input vector asking “where should I go?”
Key (K) = each node’s weights + bias (what path this node represents)
Value (V) = the leaf output if the path is chosen

\( \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V \)

Softmax weights = fraction of focus each path receives
Sigmoid + scaling = soft/hard path activation in a tree-like network

Intuition:

Sigmoid = probability of a single door opening
Softmax = distributing attention across multiple doors

🎯 Section 4 — Ensembles & Experts: Many Minds, One Output

A random forest is like a panel of maze solvers, each solving slightly differently and voting on the best exit.

Modern LLMs sometimes use Mixture-of-Experts (MoE):

\(y(x) = \sum_{e=1}^{E} g_e(x)\, f_e(x)\)

E = number of experts

f_e(x) expert network (specialized solver) ie outputs of expert
g_e(x) router deciding how much each expert contributes ie gating weight
Soft/hard gating = choosing top experts or weighting them

Ensembles and MoE are just many specialized “guides” collaborating, similar to how multiple paths or experts in a maze may contribute to the final decision.

🧠 Section 5 — Routing in LLMs: Dynamic Decision Paths

Large models use routing at different granularities:

ConceptAnalogy in the mazeSoft attentionShining a spotlight on many paths simultaneouslySparse MoESending each token to only a few specialized guidesSoft tree networksLearning structured soft decisions along corridors

All these are ways to answer the same question:

“Given this input, which paths or experts should we consult to get the right answer?”

🏁 Conclusion — From Trees to Minds

Trees = single, exclusive corridors
Attention = spotlight blending across multiple paths
Sigmoid scaling = how hard or soft doors swing
Weights + Biases = the hidden knowledge guiding your choices
Query / Key / Value = the mechanism to retrieve relevant info
Ensemble / MoE = multiple solvers collaborating

This maze metaphor ties together the structure, knowledge, and routing decisions of modern AI, showing how weights + biases + scaling + attention + experts all interact in large models.

Disclaimer: Posts reflect personal curiosity and are co-written with AI tools (including ChatGPT) to build intuition, not present formal research or advice. I have spent 2 hours (Technical writing in the age of LLM) on this topic while reading ACM paper A domain specific architecture for deep neural networks and a question popped up about the decision trees and then few more popped up which I used as hinges for the above post. What do you think of the book City with Uncertain Walls by Haruki Murakami?

Dec 28

Thinking in terms of a city with uncertain walls can make Mixture-of-Experts (MoE) models much easier to understand.

In the Deepseek-V3 architecture (summary https://lunar-joke-35b.notion.site/Deepseek-v3-101-169ba4b6a3fa8090a7aacaee1a1cefaa), only a small subset of specialists (“experts”) are consulted for each token, similar to how a traveller in our uncertain city might visit only a few local guides instead of everyone in town.

In concrete terms, Deepseek-V3’s MoE design activates just a fraction of its parameters — for example, 37 B out of 671 B — based on dynamic routing decisions made per token. This is like having many expert routes in the city, but the traveller only lights the paths that are most relevant to their current question.

This bridges beautifully to the city walls analogy: instead of blindly exploring every possible corridor, the model uses a router (like a guide) to decide which experts (paths) to trust, reducing computational waste while still navigating efficiently. In the Deepseek-V3 primer you’re reading on Notion, you can see how the architecture mixes dynamic routing and latent attention to accomplish this.

Better ideas Newsletter

Discussion about this post

Ready for more?