Skip to content

Architecture Overview

GPT, LLaMA, and DeepSeek are not different architectures. They are different configurations of the same four building blocks. This page describes how lmxlab encodes that observation.

Configs, not subclasses

Most ML codebases define one class per architecture: GPTModel, LlamaModel, DeepSeekModel. Each duplicates the transformer skeleton (embed, blocks, norm, head) with minor variations in the block internals.

lmxlab takes a different approach. There is one LanguageModel class and one ConfigurableBlock class. Architecture variants are expressed as config factories -- plain functions that return a ModelConfig:

# These three calls produce the same type: ModelConfig
from lmxlab.models.llama import llama_config
from lmxlab.models.deepseek import deepseek_config
from lmxlab.core.config import BlockConfig, ModelConfig

gpt_config = ModelConfig(
    block=BlockConfig(attention='mha', ffn='standard', norm='layer_norm', position='sinusoidal'),
    vocab_size=50257,
    n_layers=12,
)

llama = llama_config(d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32)
deepseek = deepseek_config(d_model=5120, n_heads=128, n_layers=60, kv_lora_rank=512)

All three are ModelConfig values. All three build via LanguageModel(config). The differences are in the string names ('mha' vs 'gqa' vs 'mla') and numeric parameters.

The four component registries

A BlockConfig names its components as strings. Those strings are resolved at construction time by four typed registries:

Registry Key Class Used by
attention_registry 'mha' MHA GPT
'gqa' GQA LLaMA, Mistral
'mla' MLA DeepSeek V2/V3
ffn_registry 'standard' StandardFFN GPT
'gated' GatedFFN (SwiGLU) LLaMA, DeepSeek
norm_registry 'layer_norm' LayerNorm GPT
'rms_norm' RMSNorm LLaMA, DeepSeek
position_registry 'rope' RoPE LLaMA, DeepSeek
'sinusoidal' Sinusoidal GPT
'alibi' ALiBi BLOOM

Each registry is a Registry[T] instance -- a typed dictionary with a decorator-based registration API:

from lmxlab.core.registry import Registry
from lmxlab.core.attention import attention_registry

@attention_registry.register('gqa')
class GQA(AttentionBase):
    ...

# Later, at construction time:
attn_cls = attention_registry.get('gqa')  # Returns the GQA class

Requesting a nonexistent key produces an error listing the available options.

How a model is built

Here is the full construction chain, from config to runnable model:

ModelConfig
  |
  v
LanguageModel.__init__()
  |-- nn.Embedding(vocab_size, d_model)
  |
  |-- for each layer:
  |     ConfigurableBlock(block_config)
  |       |-- attention_registry.get(config.attention)(config)
  |       |-- ffn_registry.get(config.ffn)(config)
  |       |-- norm_registry.get(config.norm)(config)  x2
  |       |-- position_registry.get(config.position)(config)
  |
  |-- norm_registry.get(config.norm)(config)   # final norm
  |-- nn.Linear (or tied embedding)            # output head

Every component's constructor takes a BlockConfig. This uniform interface is what makes the registry pattern work: any component can be swapped without changing the wiring code.

Config factories as architecture specifications

The factory functions in lmxlab.models are deliberately simple. Here is llama_config in its entirety:

def llama_config(
    vocab_size=32000, d_model=4096, n_heads=32,
    n_kv_heads=8, n_layers=32, d_ff=11008,
    max_seq_len=4096, rope_theta=10000.0,
    tie_embeddings=False,
) -> ModelConfig:
    block = BlockConfig(
        attention='gqa', ffn='gated', norm='rms_norm',
        position='rope', d_model=d_model, n_heads=n_heads,
        n_kv_heads=n_kv_heads, d_ff=d_ff, bias=False,
        rope_theta=rope_theta, max_seq_len=max_seq_len,
        pre_norm=True,
    )
    return ModelConfig(
        block=block, vocab_size=vocab_size,
        n_layers=n_layers, tie_embeddings=tie_embeddings,
    )

There is no class to subclass and no abstract methods to override. A new architecture is defined by writing a function that returns a ModelConfig. If the architecture needs a new attention mechanism, it is registered and referenced by name.

Per-layer block overrides

Some architectures use different block configurations at different layers (for example, interleaving dense and MoE layers). ModelConfig supports this via the block_configs field:

dense_block = BlockConfig(attention='gqa', ffn='gated', ...)
moe_block = BlockConfig(attention='gqa', ffn='moe', ...)

config = ModelConfig(
    block=dense_block,  # default (used if block_configs is None)
    n_layers=4,
    block_configs=(dense_block, moe_block, dense_block, moe_block),
)

The get_block_config(layer_idx) method returns the per-layer config if provided, otherwise falls back to the shared block config.

Consequences

  1. A full LLaMA architecture specification fits in a single config dict, with no class hierarchy to trace.
  2. Components compose freely: GQA attention with LayerNorm, or MHA with RMSNorm. The registries impose no coupling.
  3. Adding a new attention variant requires implementing AttentionBase, registering it, and referencing it by name. LanguageModel and ConfigurableBlock remain unchanged.
  4. Comparing LLaMA and DeepSeek reduces to comparing two config dicts. Structural similarities and differences are visible at a glance.

Next steps

  • Configurable Block -- How ConfigurableBlock assembles components and handles pre-norm vs post-norm.
  • MLX Idioms -- How the training loop and model internals use MLX-specific patterns.