Model Architectures

lmxlab implements 24 architectures as config factories (functions that return a ModelConfig). The same LanguageModel class handles all of them through ConfigurableBlock.

Architecture Comparison

Architecture	Attention	FFN	Norm	Position	Bias	KV Heads	Special
GPT	MHA	Standard	LayerNorm	Sinusoidal	Yes	= n_heads	Baseline
LLaMA	GQA	Gated (SwiGLU)	RMSNorm	RoPE	No	< n_heads	-
Gemma	GQA (MQA)	Gated	RMSNorm	RoPE	No	1	Tied embeddings
Gemma 3	SlidingWindowGQA + GQA	Gated	RMSNorm	RoPE	No	< n_heads	Sliding window
Qwen	GQA	Gated	RMSNorm	RoPE (θ=1M)	Yes	< n_heads	High RoPE theta
Qwen 3 MoE	GQA	SharedExpertMoE	RMSNorm	RoPE	No	< n_heads	64 experts, top-8
Qwen 3.5	DeltaNet + GQA	Gated	RMSNorm	Conv + RoPE	No	< n_heads	3:1 hybrid
Qwen-Next	GatedGQA	Gated	RMSNorm	RoPE	No	< n_heads	Sigmoid output gate
Mixtral	GQA	Gated (MoE)	RMSNorm	RoPE (θ=1M)	No	< n_heads	8 experts, top-2
DeepSeek V2	MLA	Gated	RMSNorm	Decoupled RoPE	No	Latent	KV compression
DeepSeek V3	MLA	SharedExpertMoE	RMSNorm	Decoupled RoPE	No	Latent	MLA + MoE
Nemotron	Mamba-2 + GQA	ReLU² / MoE	RMSNorm	RoPE	No	< n_heads	M/E/* hybrid pattern
Llama 4 Scout	ChunkedGQA	SharedExpertMoE	RMSNorm	iRoPE	No	< n_heads	Chunked attn + NoPE
Llama 4 Maverick	ChunkedGQA	SharedExpertMoE	RMSNorm	iRoPE	No	< n_heads	128 experts, top-1
Mistral Small	SlidingWindowGQA	Gated	RMSNorm	RoPE	No	< n_heads	All-local window
OLMo 2	GQA	Gated	RMSNorm	RoPE	No	< n_heads	QK-norm
GPT-OSS	GQA	Gated	RMSNorm	RoPE	No	< n_heads	QK-norm, tied embs
Grok	GQA	SharedExpertMoE	RMSNorm	RoPE	No	< n_heads	8 experts + shared
Kimi K2.5	DeltaNet + GQA	SharedExpertMoE	RMSNorm	Conv + RoPE	No	< n_heads	128 experts + shared
SmolLM3	GQA	Gated	RMSNorm	iRoPE	No	< n_heads	RoPE + NoPE layers
Falcon H1	Mamba-2 + GQA	Gated	RMSNorm	RoPE	No	< n_heads	Hybrid M// pattern
Jamba	Mamba-2 + GQA	Gated / MoE	RMSNorm	RoPE	No	< n_heads	MMMA + MoE alternation
Bamba	Mamba-2 + GQA	Gated	RMSNorm	RoPE	No	< n_heads	IBM hybrid M//
GLM-4.5	MLA	Gated	RMSNorm	NoPE	No	Latent	MLA without RoPE

GPT

GPT uses standard multi-head attention, LayerNorm, and sinusoidal positional encoding.

from lmxlab.models.gpt import gpt_config

config = gpt_config()
# attention="mha", norm="layer_norm", ffn="standard"
# position="sinusoidal", bias=True

This architecture uses bias in all linear layers, pre-norm LayerNorm (matching GPT-2), and is the only architecture with a standard (non-gated) FFN.

LLaMA

LLaMA uses grouped-query attention for memory efficiency, RMSNorm for speed, and a SwiGLU FFN.

from lmxlab.models.llama import llama_config

config = llama_config()
# attention="gqa", norm="rms_norm", ffn="gated"
# position="rope", bias=False, n_kv_heads=8

No bias terms are used. GQA maps 8 KV heads across 32 query heads, with RoPE for position encoding.

Gemma

Google's efficient variant with multi-query attention (single KV head) and tied input/output embeddings.

from lmxlab.models.gemma import gemma_config

config = gemma_config()
# n_kv_heads=1 (multi-query), tie_embeddings=True

When n_kv_heads=1, GQA reduces to Multi-Query Attention (MQA): all query heads share a single set of keys and values.

Qwen

Alibaba's architecture with high RoPE theta for long context and bias in QKV projections.

from lmxlab.models.qwen import qwen_config

config = qwen_config()
# rope_theta=1_000_000.0, bias=True

A higher RoPE theta extends the effective context window by shifting the frequency spectrum of positional encodings toward lower frequencies.

Mixtral (MoE)

Mixtral is a sparse Mixture of Experts model that routes each token to 2 of 8 expert FFNs.

from lmxlab.models.mixtral import mixtral_config

config = mixtral_config()
# Uses MoEFFN: 8 experts, top-2 routing

MoE increases model capacity without proportionally increasing compute, since each token activates only 2 of 8 expert FFN parameter sets.

DeepSeek V2 (MLA)

Multi-Head Latent Attention compresses KV representations into a low-rank latent space, reducing KV cache size by approximately 57x relative to MHA.

from lmxlab.models.deepseek import deepseek_config

config = deepseek_config()
# attention="mla", kv_lora_rank=512, rope_dim=64
# q_lora_rank=1536

MLA operates as follows:

Down-project KV from d_model to kv_lora_rank (plus rope_dim for a shared RoPE key)
Cache only the compressed latent (not full K, V)
Up-project the latent to multi-head K and V at attention time
Decoupled RoPE: position information is kept in a separate single-head key

Cache per token: kv_lora_rank + rope_dim = 576 vs 2 × n_heads × head_dim = 32,768 for MHA.

Gemma 3 (Interleaved Attention)

Mixes sliding window (local) and global attention layers. Most layers use a fixed window; every Nth layer attends to the full sequence.

from lmxlab.models.gemma3 import gemma3_config

config = gemma3_config()
# Every 6th layer: global GQA (5:1 local:global ratio)
# Other layers: sliding_window_gqa with window_size=4096
# (Real Gemma 3 uses window_size=1024; adjust to match)

Local attention is O(n * w) instead of O(n^2), making long sequences tractable while periodic global layers maintain long-range dependencies. This architecture uses per-layer block_configs, exercising the ConfigurableBlock system's per-layer configuration.

Qwen 3.5 (Hybrid DeltaNet + GQA)

Qwen 3.5 interleaves Gated DeltaNet (linear attention with delta rule) and standard GQA layers in a 3:1 ratio.

from lmxlab.models.qwen35 import qwen35_config

config = qwen35_config()
# 75% gated_deltanet layers + 25% gqa layers
# DeltaNet: causal conv, no RoPE, fixed-size state
# GQA: standard with RoPE, growing KV cache

Gated DeltaNet operates as follows:

Delta rule: the state matrix S predicts v from k, then corrects itself based on prediction error: S = alpha * S - beta * (S@k - v)@k^T
Decay gate (alpha): learned selective forgetting that controls how much old context is discarded
Update gate (beta): learned correction strength that controls how much new information is incorporated
Fixed-size state: O(d^2) per token regardless of sequence length (compared to O(n) for a KV cache)
Short causal convolutions replace RoPE for local context in DeltaNet layers

Pure linear attention loses expressiveness by compressing all history into a fixed-size state. The hybrid 3:1 pattern preserves efficient long-context processing via DeltaNet while periodic GQA layers provide full attention when needed.

Qwen-Next (Gated Attention)

Qwen-Next uses GatedGQA, a variant of GQA with a learned sigmoid gate on the attention output. The gate modulates how much attention information passes through, improving gradient flow and representational capacity.

from lmxlab.models.qwen_next import qwen_next_config

config = qwen_next_config()
# attention="gated_gqa": y = attn_out * sigmoid(W_gate @ x)

The output gate (G1 elementwise variant from arXiv:2505.06708) adds minimal parameters but measurably improves training dynamics by providing a learned bypass around the attention mechanism.

DeepSeek V3 (MLA + MoE)

Extends DeepSeek V2's MLA attention with SharedExpertMoE FFN layers, combining KV compression with sparse expert routing.

from lmxlab.models.deepseek import deepseek_v3_config

config = deepseek_v3_config()
# MLA attention + SharedExpertMoE (256 routed experts + 1 shared)
# Sigmoid routing with bias correction

Nemotron (Hybrid Mamba-Transformer MoE)

NVIDIA's hybrid architecture mixing three layer types encoded in a pattern string: M (Mamba-2 SSD), E (LatentMoE), and * (standard attention + dense FFN with squared ReLU).

from lmxlab.models.nemotron import nemotron3_config

config = nemotron3_config()
# hybrid_override_pattern: "M*M*M*M*MEMEMEMEMEMEMEMEMEME..."
# Mamba layers for sequence mixing, MoE for capacity

Each layer type serves a distinct role: Mamba-2 for efficient sequence mixing, attention for precise retrieval, and LatentMoE for routing tokens to specialized experts with reduced dimensionality.

Llama 4 Scout (iRoPE + Chunked Attention)

Uses the iRoPE pattern: most layers use chunked local attention with RoPE positions resetting at chunk boundaries, while every 4th layer uses global GQA without positional encoding (NoPE) for cross-chunk information flow. All layers use SharedExpertMoE FFN.

from lmxlab.models.llama4 import llama4_scout_config

config = llama4_scout_config()
# 75% chunked_gqa (8192 chunk size, local RoPE)
# 25% gqa (no position encoding — NoPE layers)
# All layers: SharedExpertMoE (16 experts + 1 shared)

Mistral Small (Sliding Window)

All layers use sliding-window GQA with a fixed window size. Unlike Gemma 3's mixed approach, Mistral Small applies local attention uniformly with no global layers.

from lmxlab.models.mistral import mistral_small_config

config = mistral_small_config()
# All layers: sliding_window_gqa with window_size=4096

OLMo 2 (QK-Norm)

AllenAI's architecture adds QK-norm (per-head RMSNorm on Q and K after projection, before RoPE) to the standard LLaMA-like template. This stabilizes training at scale.

from lmxlab.models.olmo import olmo2_config

config = olmo2_config()
# Standard GQA + QK-norm (per-head RMSNorm on Q, K)

GPT-OSS (QK-Norm + Tied Embeddings)

OpenAI's open-source GPT uses a LLaMA-like architecture with QK-norm and tied input/output embeddings.

from lmxlab.models.gpt_oss import gpt_oss_config

config = gpt_oss_config()
# GQA + QK-norm, tied embeddings, no bias

Grok (SharedExpertMoE)

xAI's architecture uses GQA attention with SharedExpertMoE FFN in every layer. All layers are homogeneous (no hybrid pattern).

from lmxlab.models.grok import grok_config

config = grok_config()
# GQA + SharedExpertMoE (8 routed + 1 shared)

Kimi K2.5 (DeltaNet + MoE)

Moonshot AI's hybrid: interleaves GQA and Gated DeltaNet attention layers (every 4th layer is DeltaNet), all with SharedExpertMoE FFN (128 experts).

from lmxlab.models.kimi import kimi_config

config = kimi_config()
# 75% gqa + 25% gated_deltanet layers
# All layers: SharedExpertMoE (128 routed + 1 shared)

SmolLM3 (iRoPE)

HuggingFace's efficient model uses the iRoPE pattern: most layers use GQA with RoPE, while every 4th layer uses GQA without positional encoding (NoPE) for long-range information flow.

from lmxlab.models.smollm import smollm3_config

config = smollm3_config()
# 75% gqa with RoPE + 25% gqa without position encoding
# Tied embeddings

Qwen 3 MoE

Alibaba's MoE variant of Qwen 3 with GQA attention and SharedExpertMoE FFN (64 routed experts, top-8 routing, plus 1 shared expert).

from lmxlab.models.qwen import qwen3_moe_config

config = qwen3_moe_config()
# GQA + SharedExpertMoE (64 experts, top_k=8, 1 shared)

Llama 4 Maverick (iRoPE + 128 Experts)

The larger Llama 4 variant with the same iRoPE pattern as Scout but with 128 routed experts and top-1 routing.

from lmxlab.models.llama4 import llama4_maverick_config

config = llama4_maverick_config()
# Same iRoPE pattern as Scout
# SharedExpertMoE (128 routed + 1 shared, top_k=1)

Falcon H1 (Hybrid Mamba-2 + GQA)

TII's hybrid architecture with most layers using Mamba-2 SSM and periodic GQA attention layers for global context. All layers use GatedFFN (SwiGLU).

from lmxlab.models.falcon import falcon_h1_config

config = falcon_h1_config()
# hybrid_pattern: "MMMM*MMM*MMM*MMM*"
# Mamba-2 layers (M) + GQA attention layers (*)

Jamba (Mamba-2 + GQA + MoE)

AI21's hybrid using an MMMA pattern (3 Mamba-2 + 1 GQA per cycle) with MoE FFN on alternating attention layers.

from lmxlab.models.jamba import jamba_config

config = jamba_config()
# MMMA pattern: 3 Mamba-2 + 1 GQA per cycle
# MoE (16 experts, top-2) on even attention layers

Bamba (Hybrid Mamba-2 + GQA)

IBM's hybrid variant similar to Falcon H1. Mamba-2 layers handle sequence mixing with periodic GQA layers for global attention.

from lmxlab.models.bamba import bamba_config

config = bamba_config()
# hybrid_pattern: "MMMM*MMM*MMM*MMM*"
# Similar to Falcon H1 with different hyperparameters

GLM-4.5 (MLA NoPE)

Zhipu AI's architecture using MLA attention with rope_dim=0 (no positional encoding). Relies entirely on learned attention patterns for position information.

from lmxlab.models.glm import glm45_config

config = glm45_config()
# MLA attention with rope_dim=0 (NoPE)
# KV compression via kv_lora_rank=512

Setting rope_dim=0 in MLA removes the decoupled RoPE key entirely; position-dependent patterns are learned implicitly through the latent representations.

Loading Pretrained Weights

lmxlab can load pretrained weights from HuggingFace Hub for LLaMA, Gemma, Qwen2, and Mistral models.

Quick load (requires `huggingface_hub`)

from lmxlab.models.convert import load_from_hf

model, config = load_from_hf('meta-llama/Llama-3.2-1B')

Manual conversion

Given local weights and config files:

import json
from lmxlab.models.convert import config_from_hf, convert_weights
from lmxlab.models.base import LanguageModel
import mlx.core as mx

# Load HF config.json
hf_config = json.loads(open('config.json').read())
model_config = config_from_hf(hf_config)

# Load and convert weights
hf_weights = mx.load('model.safetensors')
lmt_weights = convert_weights(hf_weights, 'llama')

# Build model and load
model = LanguageModel(model_config)
model.load_weights(list(lmt_weights.items()))

Weight name mapping

The conversion handles the naming differences between HF and lmxlab:

HuggingFace	lmxlab
`model.embed_tokens.weight`	`embed.weight`
`model.layers.{i}.self_attn.q_proj.weight`	`blocks.{i}.attention.q_proj.weight`
`model.layers.{i}.mlp.gate_proj.weight`	`blocks.{i}.ffn.gate.weight`
`model.layers.{i}.mlp.up_proj.weight`	`blocks.{i}.ffn.up.weight`
`model.layers.{i}.mlp.down_proj.weight`	`blocks.{i}.ffn.down.weight`
`model.layers.{i}.input_layernorm.weight`	`blocks.{i}.attn_norm.weight`
`model.layers.{i}.post_attention_layernorm.weight`	`blocks.{i}.ffn_norm.weight`
`model.norm.weight`	`final_norm.weight`
`lm_head.weight`	`head.weight`

Quantization

lmxlab supports post-training quantization via MLX's native affine quantization. This replaces nn.Linear with nn.QuantizedLinear, reducing memory by ~4-8x.

from lmxlab.core.quantize import quantize_model, dequantize_model

# Quantize to 4-bit (default)
quantize_model(model, bits=4, group_size=64)

# Or load from HF already quantized
model, config = load_from_hf('meta-llama/Llama-3.2-1B', quantize=4)

# Dequantize back to float for fine-tuning
dequantize_model(model)

Bits	Memory reduction	Quality	Use case
8	~4x	Near-lossless	Fine-tuning, high-quality inference
4	~8x	Good	Inference, fitting large models in memory

LoRA (Low-Rank Adaptation)

Fine-tune pretrained models efficiently by training only small low-rank matrices instead of all weights. Reduces trainable parameters by 10-100x.

from lmxlab.core.lora import apply_lora, merge_lora

# Apply LoRA to attention layers (rank=8)
apply_lora(model, rank=8, targets=['attention'])

# Train — only LoRA params are trainable (~0.1% of total)
trainer = Trainer(model, train_config)
trainer.train(data)

# Merge LoRA back into base weights for inference
merge_lora(model)

Each targeted nn.Linear is replaced with LoRALinear, which computes y = xW^T + scaling * x @ A @ B^T. Matrix B is zero-initialized so that the model initially produces the same output as the base model. Only A and B are trainable; W is frozen.

Supported targets: - 'attention' - q/k/v/o projections - 'ffn' - gate/up/down projections

QLoRA (Quantized LoRA)

QLoRA combines quantization and LoRA for memory efficiency: base weights remain in 4-bit quantized form while LoRA adapters train in full precision. This permits fine-tuning of models that would otherwise exceed available memory.

from lmxlab.core.quantize import quantize_model
from lmxlab.core.qlora import apply_qlora

# Quantize base model to 4-bit
quantize_model(model, bits=4)

# Apply LoRA on top of quantized layers
apply_qlora(model, rank=8, targets=['attention'])

# Train — only LoRA params are trainable
trainer = Trainer(model, train_config)
trainer.train(data)

Each targeted nn.QuantizedLinear is replaced with LoRAQuantizedLinear, which uses mx.quantized_matmul for the frozen base computation and adds a full-precision low-rank update: y = quantized_matmul(x, W_q) + scaling * x @ A @ B^T.

QLoRA vs LoRA comparison:

Approach	Base weights	Memory	Use case
LoRA	Float16	Full model + LoRA	Plenty of memory
QLoRA	4-bit quantized	~25% of full + LoRA	Large models, tight memory

Creating a Tiny Model

Every architecture has a _tiny() factory for testing:

from lmxlab.models.gpt import gpt_tiny
from lmxlab.models.deepseek import deepseek_tiny

config = gpt_tiny()       # d_model=64, 2 layers, 4 heads
config = deepseek_tiny()  # d_model=64, 2 layers, kv_lora_rank=16

These use small dimensions (d_model ≤ 128, n_layers ≤ 4, vocab ≤ 1024) to enable fast unit testing and quick experiments.