Discussion about this post

User's avatar
adhyayan's avatar

Thinking in terms of a city with uncertain walls can make Mixture-of-Experts (MoE) models much easier to understand.

In the Deepseek-V3 architecture (summary https://lunar-joke-35b.notion.site/Deepseek-v3-101-169ba4b6a3fa8090a7aacaee1a1cefaa), only a small subset of specialists (“experts”) are consulted for each token, similar to how a traveller in our uncertain city might visit only a few local guides instead of everyone in town.

In concrete terms, Deepseek-V3’s MoE design activates just a fraction of its parameters — for example, 37 B out of 671 B — based on dynamic routing decisions made per token. This is like having many expert routes in the city, but the traveller only lights the paths that are most relevant to their current question.

This bridges beautifully to the city walls analogy: instead of blindly exploring every possible corridor, the model uses a router (like a guide) to decide which experts (paths) to trust, reducing computational waste while still navigating efficiently. In the Deepseek-V3 primer you’re reading on Notion, you can see how the architecture mixes dynamic routing and latent attention to accomplish this.

No posts

Ready for more?