Architecture Overview
GPT, LLaMA, and DeepSeek are not different architectures. They are different configurations of the same four building blocks. This page describes how lmxlab encodes that observation.
Configs, not subclasses
Most ML codebases define one class per architecture: GPTModel, LlamaModel,
DeepSeekModel. Each duplicates the transformer skeleton (embed, blocks, norm,
head) with minor variations in the block internals.
lmxlab takes a different approach. There is one LanguageModel class and
one ConfigurableBlock class. Architecture variants are expressed as
config factories -- plain functions that return a ModelConfig:
# These three calls produce the same type: ModelConfig
from lmxlab.models.llama import llama_config
from lmxlab.models.deepseek import deepseek_config
from lmxlab.core.config import BlockConfig, ModelConfig
gpt_config = ModelConfig(
block=BlockConfig(attention='mha', ffn='standard', norm='layer_norm', position='sinusoidal'),
vocab_size=50257,
n_layers=12,
)
llama = llama_config(d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32)
deepseek = deepseek_config(d_model=5120, n_heads=128, n_layers=60, kv_lora_rank=512)
All three are ModelConfig values. All three build via LanguageModel(config).
The differences are in the string names ('mha' vs 'gqa' vs 'mla') and
numeric parameters.
The four component registries
A BlockConfig names its components as strings. Those strings are resolved at
construction time by four typed registries:
| Registry | Key | Class | Used by |
|---|---|---|---|
attention_registry |
'mha' |
MHA |
GPT |
'gqa' |
GQA |
LLaMA, Mistral | |
'mla' |
MLA |
DeepSeek V2/V3 | |
ffn_registry |
'standard' |
StandardFFN |
GPT |
'gated' |
GatedFFN (SwiGLU) |
LLaMA, DeepSeek | |
norm_registry |
'layer_norm' |
LayerNorm |
GPT |
'rms_norm' |
RMSNorm |
LLaMA, DeepSeek | |
position_registry |
'rope' |
RoPE |
LLaMA, DeepSeek |
'sinusoidal' |
Sinusoidal |
GPT | |
'alibi' |
ALiBi |
BLOOM |
Each registry is a Registry[T] instance -- a typed dictionary with a
decorator-based registration API:
from lmxlab.core.registry import Registry
from lmxlab.core.attention import attention_registry
@attention_registry.register('gqa')
class GQA(AttentionBase):
...
# Later, at construction time:
attn_cls = attention_registry.get('gqa') # Returns the GQA class
Requesting a nonexistent key produces an error listing the available options.
How a model is built
Here is the full construction chain, from config to runnable model:
ModelConfig
|
v
LanguageModel.__init__()
|-- nn.Embedding(vocab_size, d_model)
|
|-- for each layer:
| ConfigurableBlock(block_config)
| |-- attention_registry.get(config.attention)(config)
| |-- ffn_registry.get(config.ffn)(config)
| |-- norm_registry.get(config.norm)(config) x2
| |-- position_registry.get(config.position)(config)
|
|-- norm_registry.get(config.norm)(config) # final norm
|-- nn.Linear (or tied embedding) # output head
Every component's constructor takes a BlockConfig. This uniform interface
is what makes the registry pattern work: any component can be swapped
without changing the wiring code.
Config factories as architecture specifications
The factory functions in lmxlab.models are deliberately simple. Here is
llama_config in its entirety:
def llama_config(
vocab_size=32000, d_model=4096, n_heads=32,
n_kv_heads=8, n_layers=32, d_ff=11008,
max_seq_len=4096, rope_theta=10000.0,
tie_embeddings=False,
) -> ModelConfig:
block = BlockConfig(
attention='gqa', ffn='gated', norm='rms_norm',
position='rope', d_model=d_model, n_heads=n_heads,
n_kv_heads=n_kv_heads, d_ff=d_ff, bias=False,
rope_theta=rope_theta, max_seq_len=max_seq_len,
pre_norm=True,
)
return ModelConfig(
block=block, vocab_size=vocab_size,
n_layers=n_layers, tie_embeddings=tie_embeddings,
)
There is no class to subclass and no abstract methods to override. A new
architecture is defined by writing a function that returns a ModelConfig.
If the architecture needs a new attention mechanism, it is registered and
referenced by name.
Per-layer block overrides
Some architectures use different block configurations at different layers
(for example, interleaving dense and MoE layers). ModelConfig supports
this via the block_configs field:
dense_block = BlockConfig(attention='gqa', ffn='gated', ...)
moe_block = BlockConfig(attention='gqa', ffn='moe', ...)
config = ModelConfig(
block=dense_block, # default (used if block_configs is None)
n_layers=4,
block_configs=(dense_block, moe_block, dense_block, moe_block),
)
The get_block_config(layer_idx) method returns the per-layer config if
provided, otherwise falls back to the shared block config.
Consequences
- A full LLaMA architecture specification fits in a single config dict, with no class hierarchy to trace.
- Components compose freely: GQA attention with LayerNorm, or MHA with RMSNorm. The registries impose no coupling.
- Adding a new attention variant requires implementing
AttentionBase, registering it, and referencing it by name.LanguageModelandConfigurableBlockremain unchanged. - Comparing LLaMA and DeepSeek reduces to comparing two config dicts. Structural similarities and differences are visible at a glance.
Next steps
- Configurable Block -- How
ConfigurableBlockassembles components and handles pre-norm vs post-norm. - MLX Idioms -- How the training loop and model internals use MLX-specific patterns.