Skip to content

Model Architectures

lmxlab implements 24 architectures as config factories (functions that return a ModelConfig). The same LanguageModel class handles all of them through ConfigurableBlock.

Architecture Comparison

Architecture Attention FFN Norm Position Bias KV Heads Special
GPT MHA Standard LayerNorm Sinusoidal Yes = n_heads Baseline
LLaMA GQA Gated (SwiGLU) RMSNorm RoPE No < n_heads -
Gemma GQA (MQA) Gated RMSNorm RoPE No 1 Tied embeddings
Gemma 3 SlidingWindowGQA + GQA Gated RMSNorm RoPE No < n_heads Sliding window
Qwen GQA Gated RMSNorm RoPE (θ=1M) Yes < n_heads High RoPE theta
Qwen 3 MoE GQA SharedExpertMoE RMSNorm RoPE No < n_heads 64 experts, top-8
Qwen 3.5 DeltaNet + GQA Gated RMSNorm Conv + RoPE No < n_heads 3:1 hybrid
Qwen-Next GatedGQA Gated RMSNorm RoPE No < n_heads Sigmoid output gate
Mixtral GQA Gated (MoE) RMSNorm RoPE (θ=1M) No < n_heads 8 experts, top-2
DeepSeek V2 MLA Gated RMSNorm Decoupled RoPE No Latent KV compression
DeepSeek V3 MLA SharedExpertMoE RMSNorm Decoupled RoPE No Latent MLA + MoE
Nemotron Mamba-2 + GQA ReLU² / MoE RMSNorm RoPE No < n_heads M/E/* hybrid pattern
Llama 4 Scout ChunkedGQA SharedExpertMoE RMSNorm iRoPE No < n_heads Chunked attn + NoPE
Llama 4 Maverick ChunkedGQA SharedExpertMoE RMSNorm iRoPE No < n_heads 128 experts, top-1
Mistral Small SlidingWindowGQA Gated RMSNorm RoPE No < n_heads All-local window
OLMo 2 GQA Gated RMSNorm RoPE No < n_heads QK-norm
GPT-OSS GQA Gated RMSNorm RoPE No < n_heads QK-norm, tied embs
Grok GQA SharedExpertMoE RMSNorm RoPE No < n_heads 8 experts + shared
Kimi K2.5 DeltaNet + GQA SharedExpertMoE RMSNorm Conv + RoPE No < n_heads 128 experts + shared
SmolLM3 GQA Gated RMSNorm iRoPE No < n_heads RoPE + NoPE layers
Falcon H1 Mamba-2 + GQA Gated RMSNorm RoPE No < n_heads Hybrid M// pattern
Jamba Mamba-2 + GQA Gated / MoE RMSNorm RoPE No < n_heads MMMA + MoE alternation
Bamba Mamba-2 + GQA Gated RMSNorm RoPE No < n_heads IBM hybrid M//
GLM-4.5 MLA Gated RMSNorm NoPE No Latent MLA without RoPE

GPT

GPT uses standard multi-head attention, LayerNorm, and sinusoidal positional encoding.

from lmxlab.models.gpt import gpt_config

config = gpt_config()
# attention="mha", norm="layer_norm", ffn="standard"
# position="sinusoidal", bias=True

This architecture uses bias in all linear layers, pre-norm LayerNorm (matching GPT-2), and is the only architecture with a standard (non-gated) FFN.

LLaMA

LLaMA uses grouped-query attention for memory efficiency, RMSNorm for speed, and a SwiGLU FFN.

from lmxlab.models.llama import llama_config

config = llama_config()
# attention="gqa", norm="rms_norm", ffn="gated"
# position="rope", bias=False, n_kv_heads=8

No bias terms are used. GQA maps 8 KV heads across 32 query heads, with RoPE for position encoding.

Gemma

Google's efficient variant with multi-query attention (single KV head) and tied input/output embeddings.

from lmxlab.models.gemma import gemma_config

config = gemma_config()
# n_kv_heads=1 (multi-query), tie_embeddings=True

When n_kv_heads=1, GQA reduces to Multi-Query Attention (MQA): all query heads share a single set of keys and values.

Qwen

Alibaba's architecture with high RoPE theta for long context and bias in QKV projections.

from lmxlab.models.qwen import qwen_config

config = qwen_config()
# rope_theta=1_000_000.0, bias=True

A higher RoPE theta extends the effective context window by shifting the frequency spectrum of positional encodings toward lower frequencies.

Mixtral (MoE)

Mixtral is a sparse Mixture of Experts model that routes each token to 2 of 8 expert FFNs.

from lmxlab.models.mixtral import mixtral_config

config = mixtral_config()
# Uses MoEFFN: 8 experts, top-2 routing

MoE increases model capacity without proportionally increasing compute, since each token activates only 2 of 8 expert FFN parameter sets.

DeepSeek V2 (MLA)

Multi-Head Latent Attention compresses KV representations into a low-rank latent space, reducing KV cache size by approximately 57x relative to MHA.

from lmxlab.models.deepseek import deepseek_config

config = deepseek_config()
# attention="mla", kv_lora_rank=512, rope_dim=64
# q_lora_rank=1536

MLA operates as follows:

  1. Down-project KV from d_model to kv_lora_rank (plus rope_dim for a shared RoPE key)
  2. Cache only the compressed latent (not full K, V)
  3. Up-project the latent to multi-head K and V at attention time
  4. Decoupled RoPE: position information is kept in a separate single-head key

Cache per token: kv_lora_rank + rope_dim = 576 vs 2 × n_heads × head_dim = 32,768 for MHA.

Gemma 3 (Interleaved Attention)

Mixes sliding window (local) and global attention layers. Most layers use a fixed window; every Nth layer attends to the full sequence.

from lmxlab.models.gemma3 import gemma3_config

config = gemma3_config()
# Every 6th layer: global GQA (5:1 local:global ratio)
# Other layers: sliding_window_gqa with window_size=4096
# (Real Gemma 3 uses window_size=1024; adjust to match)

Local attention is O(n * w) instead of O(n^2), making long sequences tractable while periodic global layers maintain long-range dependencies. This architecture uses per-layer block_configs, exercising the ConfigurableBlock system's per-layer configuration.

Qwen 3.5 (Hybrid DeltaNet + GQA)

Qwen 3.5 interleaves Gated DeltaNet (linear attention with delta rule) and standard GQA layers in a 3:1 ratio.

from lmxlab.models.qwen35 import qwen35_config

config = qwen35_config()
# 75% gated_deltanet layers + 25% gqa layers
# DeltaNet: causal conv, no RoPE, fixed-size state
# GQA: standard with RoPE, growing KV cache

Gated DeltaNet operates as follows:

  1. Delta rule: the state matrix S predicts v from k, then corrects itself based on prediction error: S = alpha * S - beta * (S@k - v)@k^T
  2. Decay gate (alpha): learned selective forgetting that controls how much old context is discarded
  3. Update gate (beta): learned correction strength that controls how much new information is incorporated
  4. Fixed-size state: O(d^2) per token regardless of sequence length (compared to O(n) for a KV cache)
  5. Short causal convolutions replace RoPE for local context in DeltaNet layers

Pure linear attention loses expressiveness by compressing all history into a fixed-size state. The hybrid 3:1 pattern preserves efficient long-context processing via DeltaNet while periodic GQA layers provide full attention when needed.

Qwen-Next (Gated Attention)

Qwen-Next uses GatedGQA, a variant of GQA with a learned sigmoid gate on the attention output. The gate modulates how much attention information passes through, improving gradient flow and representational capacity.

from lmxlab.models.qwen_next import qwen_next_config

config = qwen_next_config()
# attention="gated_gqa": y = attn_out * sigmoid(W_gate @ x)

The output gate (G1 elementwise variant from arXiv:2505.06708) adds minimal parameters but measurably improves training dynamics by providing a learned bypass around the attention mechanism.

DeepSeek V3 (MLA + MoE)

Extends DeepSeek V2's MLA attention with SharedExpertMoE FFN layers, combining KV compression with sparse expert routing.

from lmxlab.models.deepseek import deepseek_v3_config

config = deepseek_v3_config()
# MLA attention + SharedExpertMoE (256 routed experts + 1 shared)
# Sigmoid routing with bias correction

Nemotron (Hybrid Mamba-Transformer MoE)

NVIDIA's hybrid architecture mixing three layer types encoded in a pattern string: M (Mamba-2 SSD), E (LatentMoE), and * (standard attention + dense FFN with squared ReLU).

from lmxlab.models.nemotron import nemotron3_config

config = nemotron3_config()
# hybrid_override_pattern: "M*M*M*M*MEMEMEMEMEMEMEMEMEME..."
# Mamba layers for sequence mixing, MoE for capacity

Each layer type serves a distinct role: Mamba-2 for efficient sequence mixing, attention for precise retrieval, and LatentMoE for routing tokens to specialized experts with reduced dimensionality.

Llama 4 Scout (iRoPE + Chunked Attention)

Uses the iRoPE pattern: most layers use chunked local attention with RoPE positions resetting at chunk boundaries, while every 4th layer uses global GQA without positional encoding (NoPE) for cross-chunk information flow. All layers use SharedExpertMoE FFN.

from lmxlab.models.llama4 import llama4_scout_config

config = llama4_scout_config()
# 75% chunked_gqa (8192 chunk size, local RoPE)
# 25% gqa (no position encoding — NoPE layers)
# All layers: SharedExpertMoE (16 experts + 1 shared)

Mistral Small (Sliding Window)

All layers use sliding-window GQA with a fixed window size. Unlike Gemma 3's mixed approach, Mistral Small applies local attention uniformly with no global layers.

from lmxlab.models.mistral import mistral_small_config

config = mistral_small_config()
# All layers: sliding_window_gqa with window_size=4096

OLMo 2 (QK-Norm)

AllenAI's architecture adds QK-norm (per-head RMSNorm on Q and K after projection, before RoPE) to the standard LLaMA-like template. This stabilizes training at scale.

from lmxlab.models.olmo import olmo2_config

config = olmo2_config()
# Standard GQA + QK-norm (per-head RMSNorm on Q, K)

GPT-OSS (QK-Norm + Tied Embeddings)

OpenAI's open-source GPT uses a LLaMA-like architecture with QK-norm and tied input/output embeddings.

from lmxlab.models.gpt_oss import gpt_oss_config

config = gpt_oss_config()
# GQA + QK-norm, tied embeddings, no bias

Grok (SharedExpertMoE)

xAI's architecture uses GQA attention with SharedExpertMoE FFN in every layer. All layers are homogeneous (no hybrid pattern).

from lmxlab.models.grok import grok_config

config = grok_config()
# GQA + SharedExpertMoE (8 routed + 1 shared)

Kimi K2.5 (DeltaNet + MoE)

Moonshot AI's hybrid: interleaves GQA and Gated DeltaNet attention layers (every 4th layer is DeltaNet), all with SharedExpertMoE FFN (128 experts).

from lmxlab.models.kimi import kimi_config

config = kimi_config()
# 75% gqa + 25% gated_deltanet layers
# All layers: SharedExpertMoE (128 routed + 1 shared)

SmolLM3 (iRoPE)

HuggingFace's efficient model uses the iRoPE pattern: most layers use GQA with RoPE, while every 4th layer uses GQA without positional encoding (NoPE) for long-range information flow.

from lmxlab.models.smollm import smollm3_config

config = smollm3_config()
# 75% gqa with RoPE + 25% gqa without position encoding
# Tied embeddings

Qwen 3 MoE

Alibaba's MoE variant of Qwen 3 with GQA attention and SharedExpertMoE FFN (64 routed experts, top-8 routing, plus 1 shared expert).

from lmxlab.models.qwen import qwen3_moe_config

config = qwen3_moe_config()
# GQA + SharedExpertMoE (64 experts, top_k=8, 1 shared)

Llama 4 Maverick (iRoPE + 128 Experts)

The larger Llama 4 variant with the same iRoPE pattern as Scout but with 128 routed experts and top-1 routing.

from lmxlab.models.llama4 import llama4_maverick_config

config = llama4_maverick_config()
# Same iRoPE pattern as Scout
# SharedExpertMoE (128 routed + 1 shared, top_k=1)

Falcon H1 (Hybrid Mamba-2 + GQA)

TII's hybrid architecture with most layers using Mamba-2 SSM and periodic GQA attention layers for global context. All layers use GatedFFN (SwiGLU).

from lmxlab.models.falcon import falcon_h1_config

config = falcon_h1_config()
# hybrid_pattern: "MMMM*MMM*MMM*MMM*"
# Mamba-2 layers (M) + GQA attention layers (*)

Jamba (Mamba-2 + GQA + MoE)

AI21's hybrid using an MMMA pattern (3 Mamba-2 + 1 GQA per cycle) with MoE FFN on alternating attention layers.

from lmxlab.models.jamba import jamba_config

config = jamba_config()
# MMMA pattern: 3 Mamba-2 + 1 GQA per cycle
# MoE (16 experts, top-2) on even attention layers

Bamba (Hybrid Mamba-2 + GQA)

IBM's hybrid variant similar to Falcon H1. Mamba-2 layers handle sequence mixing with periodic GQA layers for global attention.

from lmxlab.models.bamba import bamba_config

config = bamba_config()
# hybrid_pattern: "MMMM*MMM*MMM*MMM*"
# Similar to Falcon H1 with different hyperparameters

GLM-4.5 (MLA NoPE)

Zhipu AI's architecture using MLA attention with rope_dim=0 (no positional encoding). Relies entirely on learned attention patterns for position information.

from lmxlab.models.glm import glm45_config

config = glm45_config()
# MLA attention with rope_dim=0 (NoPE)
# KV compression via kv_lora_rank=512

Setting rope_dim=0 in MLA removes the decoupled RoPE key entirely; position-dependent patterns are learned implicitly through the latent representations.

Loading Pretrained Weights

lmxlab can load pretrained weights from HuggingFace Hub for LLaMA, Gemma, Qwen2, and Mistral models.

Quick load (requires huggingface_hub)

from lmxlab.models.convert import load_from_hf

model, config = load_from_hf('meta-llama/Llama-3.2-1B')

Manual conversion

Given local weights and config files:

import json
from lmxlab.models.convert import config_from_hf, convert_weights
from lmxlab.models.base import LanguageModel
import mlx.core as mx

# Load HF config.json
hf_config = json.loads(open('config.json').read())
model_config = config_from_hf(hf_config)

# Load and convert weights
hf_weights = mx.load('model.safetensors')
lmt_weights = convert_weights(hf_weights, 'llama')

# Build model and load
model = LanguageModel(model_config)
model.load_weights(list(lmt_weights.items()))

Weight name mapping

The conversion handles the naming differences between HF and lmxlab:

HuggingFace lmxlab
model.embed_tokens.weight embed.weight
model.layers.{i}.self_attn.q_proj.weight blocks.{i}.attention.q_proj.weight
model.layers.{i}.mlp.gate_proj.weight blocks.{i}.ffn.gate.weight
model.layers.{i}.mlp.up_proj.weight blocks.{i}.ffn.up.weight
model.layers.{i}.mlp.down_proj.weight blocks.{i}.ffn.down.weight
model.layers.{i}.input_layernorm.weight blocks.{i}.attn_norm.weight
model.layers.{i}.post_attention_layernorm.weight blocks.{i}.ffn_norm.weight
model.norm.weight final_norm.weight
lm_head.weight head.weight

Quantization

lmxlab supports post-training quantization via MLX's native affine quantization. This replaces nn.Linear with nn.QuantizedLinear, reducing memory by ~4-8x.

from lmxlab.core.quantize import quantize_model, dequantize_model

# Quantize to 4-bit (default)
quantize_model(model, bits=4, group_size=64)

# Or load from HF already quantized
model, config = load_from_hf('meta-llama/Llama-3.2-1B', quantize=4)

# Dequantize back to float for fine-tuning
dequantize_model(model)
Bits Memory reduction Quality Use case
8 ~4x Near-lossless Fine-tuning, high-quality inference
4 ~8x Good Inference, fitting large models in memory

LoRA (Low-Rank Adaptation)

Fine-tune pretrained models efficiently by training only small low-rank matrices instead of all weights. Reduces trainable parameters by 10-100x.

from lmxlab.core.lora import apply_lora, merge_lora

# Apply LoRA to attention layers (rank=8)
apply_lora(model, rank=8, targets=['attention'])

# Train — only LoRA params are trainable (~0.1% of total)
trainer = Trainer(model, train_config)
trainer.train(data)

# Merge LoRA back into base weights for inference
merge_lora(model)

Each targeted nn.Linear is replaced with LoRALinear, which computes y = xW^T + scaling * x @ A @ B^T. Matrix B is zero-initialized so that the model initially produces the same output as the base model. Only A and B are trainable; W is frozen.

Supported targets: - 'attention' - q/k/v/o projections - 'ffn' - gate/up/down projections

QLoRA (Quantized LoRA)

QLoRA combines quantization and LoRA for memory efficiency: base weights remain in 4-bit quantized form while LoRA adapters train in full precision. This permits fine-tuning of models that would otherwise exceed available memory.

from lmxlab.core.quantize import quantize_model
from lmxlab.core.qlora import apply_qlora

# Quantize base model to 4-bit
quantize_model(model, bits=4)

# Apply LoRA on top of quantized layers
apply_qlora(model, rank=8, targets=['attention'])

# Train — only LoRA params are trainable
trainer = Trainer(model, train_config)
trainer.train(data)

Each targeted nn.QuantizedLinear is replaced with LoRAQuantizedLinear, which uses mx.quantized_matmul for the frozen base computation and adds a full-precision low-rank update: y = quantized_matmul(x, W_q) + scaling * x @ A @ B^T.

QLoRA vs LoRA comparison:

Approach Base weights Memory Use case
LoRA Float16 Full model + LoRA Plenty of memory
QLoRA 4-bit quantized ~25% of full + LoRA Large models, tight memory

Creating a Tiny Model

Every architecture has a _tiny() factory for testing:

from lmxlab.models.gpt import gpt_tiny
from lmxlab.models.deepseek import deepseek_tiny

config = gpt_tiny()       # d_model=64, 2 layers, 4 heads
config = deepseek_tiny()  # d_model=64, 2 layers, kv_lora_rank=16

These use small dimensions (d_model ≤ 128, n_layers ≤ 4, vocab ≤ 1024) to enable fast unit testing and quick experiments.