Models
Language model base class and architecture config factories.
LanguageModel
lmxlab.models.base.LanguageModel
Bases: Module
Transformer language model assembled from config.
Uses ConfigurableBlock for each layer. Supports tied input/output embeddings and KV caching for generation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ModelConfig
|
Full model configuration. |
required |
Source code in src/lmxlab/models/base.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
_sinusoidal = block_cfg.position == 'sinusoidal'
instance-attribute
blocks = [(ConfigurableBlock(config.get_block_config(i))) for i in (range(config.n_layers))]
instance-attribute
config = config
instance-attribute
embed = nn.Embedding(config.vocab_size, block_cfg.d_model)
instance-attribute
embed_dropout = nn.Dropout(p=(block_cfg.dropout))
instance-attribute
final_norm = final_norm_cls(block_cfg)
instance-attribute
head = nn.Linear(block_cfg.d_model, config.vocab_size, bias=False)
instance-attribute
__call__(x, cache=None, return_hidden=False)
Forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array
|
Token IDs of shape (batch, seq_len). |
required |
cache
|
list | None
|
Optional list of caches per layer. Cache types may be heterogeneous in hybrid models (KV tuples for attention, SSM state tuples for Mamba, None for identity layers). |
None
|
return_hidden
|
bool
|
If True, also return hidden states from final_norm (before lm_head projection). Used by Multi-Token Prediction. |
False
|
Returns:
| Type | Description |
|---|---|
tuple[array, list] | tuple[array, list, array]
|
Tuple of (logits, updated_caches) by default. |
tuple[array, list] | tuple[array, list, array]
|
If return_hidden is True, returns |
tuple[array, list] | tuple[array, list, array]
|
(logits, updated_caches, hidden_states). |
Source code in src/lmxlab/models/base.py
__init__(config)
Source code in src/lmxlab/models/base.py
_apply_mup_init(width_mult)
Rescale hidden layer weights for μP.
Scales hidden layer weight init by 1/√width_mult. Embedding weights are left unchanged (μP prescribes constant embedding init across widths).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
width_mult
|
float
|
d_model / base_d_model ratio. |
required |
Source code in src/lmxlab/models/base.py
Generation
lmxlab.models.generate
Autoregressive text generation with sampling strategies.
generate(model, prompt, max_tokens=100, temperature=1.0, top_k=0, top_p=1.0, repetition_penalty=1.0, stop_tokens=None)
Generate tokens autoregressively with KV caching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
LanguageModel
|
Language model to generate from. |
required |
prompt
|
array
|
Input token IDs of shape (batch, prompt_len). |
required |
max_tokens
|
int
|
Maximum number of new tokens to generate. |
100
|
temperature
|
float
|
Sampling temperature (0 = greedy). |
1.0
|
top_k
|
int
|
If > 0, only sample from top-k tokens. |
0
|
top_p
|
float
|
If < 1.0, use nucleus sampling. |
1.0
|
repetition_penalty
|
float
|
Penalty for repeating tokens (> 1.0 discourages repetition, 1.0 = no effect). |
1.0
|
stop_tokens
|
list[int] | None
|
List of token IDs that stop generation. When any batch element generates a stop token, generation stops for all. |
None
|
Returns:
| Type | Description |
|---|---|
array
|
Generated token IDs of shape |
array
|
(batch, prompt_len + generated_len). |
Source code in src/lmxlab/models/generate.py
stream_generate(model, prompt, max_tokens=100, temperature=1.0, top_k=0, top_p=1.0, repetition_penalty=1.0, stop_tokens=None)
Generate tokens one at a time, yielding each as produced.
This is the standard interface for interactive/streaming applications. Each token is yielded immediately after generation, enabling real-time display.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
LanguageModel
|
Language model to generate from. |
required |
prompt
|
array
|
Input token IDs of shape (1, prompt_len). |
required |
max_tokens
|
int
|
Maximum number of new tokens. |
100
|
temperature
|
float
|
Sampling temperature (0 = greedy). |
1.0
|
top_k
|
int
|
If > 0, only sample from top-k. |
0
|
top_p
|
float
|
If < 1.0, use nucleus sampling. |
1.0
|
repetition_penalty
|
float
|
Penalty for repeating tokens. |
1.0
|
stop_tokens
|
list[int] | None
|
Token IDs that stop generation. |
None
|
Yields:
| Type | Description |
|---|---|
int
|
Generated token IDs one at a time. |
Source code in src/lmxlab/models/generate.py
Config Factories
Each factory returns a ModelConfig that builds the corresponding
architecture when passed to LanguageModel.
GPT
lmxlab.models.gpt.gpt_config(vocab_size=50257, d_model=768, n_heads=12, n_layers=12, d_ff=3072, max_seq_len=1024, tie_embeddings=True, dropout=0.0, mup_base_width=None)
Create a GPT-style model configuration.
GPT uses: LayerNorm, standard MHA, standard FFN (GELU), sinusoidal positional encoding, pre-norm, bias everywhere.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size (default: GPT-2 BPE vocab). |
50257
|
d_model
|
int
|
Hidden dimension. |
768
|
n_heads
|
int
|
Number of attention heads. |
12
|
n_layers
|
int
|
Number of transformer layers. |
12
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
3072
|
max_seq_len
|
int
|
Maximum sequence length. |
1024
|
tie_embeddings
|
bool
|
Whether to tie input/output embeddings. |
True
|
dropout
|
float
|
Dropout rate. |
0.0
|
mup_base_width
|
int | None
|
Base width for μP. When set, enables μP attention scaling and logit scaling. |
None
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a GPT-style model. |
Source code in src/lmxlab/models/gpt.py
lmxlab.models.gpt.gpt_tiny()
Tiny GPT for testing (d=64, 2 layers, 2 heads).
lmxlab.models.gpt.gpt_small()
lmxlab.models.gpt.gpt_medium()
LLaMA
lmxlab.models.llama.llama_config(vocab_size=32000, d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32, d_ff=11008, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False, dropout=0.0, mup_base_width=None)
Create a LLaMA-style model configuration.
LLaMA uses: RMSNorm, GQA, GatedFFN (SwiGLU), RoPE, pre-norm, no bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
32000
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of query heads. |
32
|
n_kv_heads
|
int
|
Number of KV heads (for GQA). |
8
|
n_layers
|
int
|
Number of transformer layers. |
32
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
11008
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie input/output embeddings. |
False
|
dropout
|
float
|
Dropout rate. |
0.0
|
mup_base_width
|
int | None
|
Base width for μP. When set, enables μP attention scaling and logit scaling. |
None
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a LLaMA-style model. |
Source code in src/lmxlab/models/llama.py
lmxlab.models.llama.llama_tiny()
Tiny LLaMA for testing (d=64, 2 layers, 4 heads, 2 kv).
lmxlab.models.llama.llama_7b()
lmxlab.models.llama.llama_13b()
Gemma
lmxlab.models.gemma.gemma_config(vocab_size=256000, d_model=2048, n_heads=8, n_kv_heads=1, n_layers=18, d_ff=16384, max_seq_len=8192, rope_theta=10000.0, tie_embeddings=True)
Create a Gemma-style model configuration.
Gemma uses: RMSNorm, GQA (multi-query), GatedFFN (GeGLU), RoPE, pre-norm, no bias, tied embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
256000
|
d_model
|
int
|
Hidden dimension. |
2048
|
n_heads
|
int
|
Number of query heads. |
8
|
n_kv_heads
|
int
|
Number of KV heads. |
1
|
n_layers
|
int
|
Number of transformer layers. |
18
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
16384
|
max_seq_len
|
int
|
Maximum sequence length. |
8192
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
True
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Gemma-style model. |
Source code in src/lmxlab/models/gemma.py
lmxlab.models.gemma.gemma_tiny()
Gemma 3
lmxlab.models.gemma3.gemma3_config(vocab_size=256000, d_model=2048, n_heads=8, n_kv_heads=4, n_layers=26, d_ff=16384, max_seq_len=8192, rope_theta=10000.0, window_size=4096, global_every=6, tie_embeddings=True)
Create a Gemma 3-style model configuration.
Gemma 3 interleaves local (sliding window) and global
attention layers. Every global_every-th layer (0-indexed,
i.e. layers 5, 11, 17, ...) uses full global GQA; all other
layers use sliding window GQA with the given window size.
Uses: RMSNorm, GatedFFN (GeGLU), RoPE, pre-norm, no bias, tied embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
256000
|
d_model
|
int
|
Hidden dimension. |
2048
|
n_heads
|
int
|
Number of query heads. |
8
|
n_kv_heads
|
int
|
Number of KV heads. |
4
|
n_layers
|
int
|
Number of transformer layers. |
26
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
16384
|
max_seq_len
|
int
|
Maximum sequence length. |
8192
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
window_size
|
int
|
Sliding window size for local layers. |
4096
|
global_every
|
int
|
Place a global attention layer every N layers (1-indexed: layer global_every-1, 2*global_every-1, ...). |
6
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
True
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Gemma 3-style model. |
Source code in src/lmxlab/models/gemma3.py
lmxlab.models.gemma3.gemma3_tiny()
Tiny Gemma 3 for testing (4 layers, global every 4th).
Source code in src/lmxlab/models/gemma3.py
Qwen
lmxlab.models.qwen.qwen_config(vocab_size=151936, d_model=4096, n_heads=32, n_kv_heads=32, n_layers=32, d_ff=11008, max_seq_len=32768, rope_theta=1000000.0, tie_embeddings=False)
Create a Qwen-style model configuration.
Qwen uses: RMSNorm, GQA, GatedFFN (SwiGLU), RoPE (high theta for long context), pre-norm, bias in QKV.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
151936
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of query heads. |
32
|
n_kv_heads
|
int
|
Number of KV heads. |
32
|
n_layers
|
int
|
Number of transformer layers. |
32
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
11008
|
max_seq_len
|
int
|
Maximum sequence length. |
32768
|
rope_theta
|
float
|
RoPE base frequency. |
1000000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Qwen-style model. |
Source code in src/lmxlab/models/qwen.py
lmxlab.models.qwen.qwen_tiny()
Mixtral
lmxlab.models.mixtral.mixtral_config(vocab_size=32000, d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32, d_ff=14336, n_experts=8, top_k_experts=2, max_seq_len=32768, rope_theta=1000000.0, tie_embeddings=False)
Create a Mixtral-style model configuration.
Mixtral uses GQA attention with MoE FFN: each token is routed to top-k of n_experts GatedFFN (SwiGLU) experts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
32000
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of query heads. |
32
|
n_kv_heads
|
int
|
Number of KV heads. |
8
|
n_layers
|
int
|
Number of transformer layers. |
32
|
d_ff
|
int
|
Per-expert feed-forward dimension. |
14336
|
n_experts
|
int
|
Number of expert FFNs. |
8
|
top_k_experts
|
int
|
Experts per token. |
2
|
max_seq_len
|
int
|
Maximum sequence length. |
32768
|
rope_theta
|
float
|
RoPE base frequency. |
1000000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Mixtral-style model. |
Source code in src/lmxlab/models/mixtral.py
lmxlab.models.mixtral.mixtral_tiny()
Tiny Mixtral for testing (with MoE).
Source code in src/lmxlab/models/mixtral.py
Qwen 3.5 (Hybrid DeltaNet)
lmxlab.models.qwen35.qwen35_config(vocab_size=151936, d_model=2048, n_heads=16, n_kv_heads=4, n_layers=28, d_ff=5504, max_seq_len=32768, rope_theta=1000000.0, global_every=4, tie_embeddings=False)
Create a Qwen 3.5-style model configuration.
Qwen 3.5 interleaves Gated DeltaNet (linear attention) and
standard GQA layers. Every global_every-th layer uses
full GQA; all other layers use Gated DeltaNet.
Uses: RMSNorm, GatedFFN (SwiGLU), RoPE (for GQA layers), short causal convolutions (for DeltaNet layers), no bias.
The 3:1 hybrid ratio (75% DeltaNet, 25% GQA) balances efficiency and expressiveness: - DeltaNet: O(d^2) per token, fixed-size state, no KV cache - GQA: O(n^2) per token, growing KV cache, global context
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
151936
|
d_model
|
int
|
Hidden dimension. |
2048
|
n_heads
|
int
|
Number of attention heads. |
16
|
n_kv_heads
|
int
|
Number of KV heads (for GQA layers). |
4
|
n_layers
|
int
|
Number of transformer layers. |
28
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
5504
|
max_seq_len
|
int
|
Maximum sequence length. |
32768
|
rope_theta
|
float
|
RoPE base frequency (for GQA layers). |
1000000.0
|
global_every
|
int
|
Place a GQA layer every N layers. |
4
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Qwen 3.5-style model. |
Source code in src/lmxlab/models/qwen35.py
lmxlab.models.qwen35.qwen35_tiny()
Tiny Qwen 3.5 for testing (4 layers, global every 4th).
Source code in src/lmxlab/models/qwen35.py
DeepSeek
lmxlab.models.deepseek.deepseek_config(vocab_size=102400, d_model=5120, n_heads=128, n_layers=60, d_ff=12288, kv_lora_rank=512, q_lora_rank=1536, rope_dim=64, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False)
Create a DeepSeek V2-style model configuration.
DeepSeek V2 uses: RMSNorm, MLA, GatedFFN (SwiGLU), decoupled RoPE, pre-norm, no bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
102400
|
d_model
|
int
|
Hidden dimension. |
5120
|
n_heads
|
int
|
Number of attention heads. |
128
|
n_layers
|
int
|
Number of transformer layers. |
60
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
12288
|
kv_lora_rank
|
int
|
Latent dimension for KV compression. |
512
|
q_lora_rank
|
int
|
Latent dimension for Q compression. |
1536
|
rope_dim
|
int
|
Number of head dims for RoPE. |
64
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a DeepSeek V2-style model. |
Source code in src/lmxlab/models/deepseek.py
lmxlab.models.deepseek.deepseek_tiny()
Tiny DeepSeek for testing (d=64, 2 layers, 4 heads).
Source code in src/lmxlab/models/deepseek.py
Weight Conversion
lmxlab.models.convert.load_from_hf(repo_id, revision=None, dtype=mx.float16, quantize=None)
Download and load a HuggingFace model into lmxlab.
Requires the huggingface_hub package.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HuggingFace repo ID (e.g., 'meta-llama/Llama-3.2-1B'). |
required |
revision
|
str | None
|
Git revision (branch, tag, or commit hash). |
None
|
dtype
|
Dtype
|
Target dtype for weights (default: float16). |
float16
|
quantize
|
int | None
|
If set, quantize the model to this many bits (4 or 8) after loading. Reduces memory usage. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[LanguageModel, ModelConfig]
|
Tuple of (loaded LanguageModel, ModelConfig). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If huggingface_hub is not installed. |
ValueError
|
If model_type is not supported. |
Source code in src/lmxlab/models/convert.py
516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 | |
lmxlab.models.convert.config_from_hf(hf_config)
Create a ModelConfig from a HuggingFace config dict.
Reads config.json fields and maps them to lmxlab's
BlockConfig and ModelConfig.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hf_config
|
dict[str, Any]
|
Parsed HuggingFace config.json dict. |
required |
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig matching the HF model architecture. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If model_type is not supported. |
Source code in src/lmxlab/models/convert.py
lmxlab.models.convert.convert_weights(hf_weights, arch, pattern=None)
Convert HuggingFace weight dict to lmxlab naming.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hf_weights
|
dict[str, array]
|
Dictionary of HF parameter names to arrays. |
required |
arch
|
str
|
Architecture name (e.g., 'llama', 'nemotron_h'). |
required |
pattern
|
str | None
|
Hybrid override pattern (required for nemotron_h architecture). |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, array]
|
Dictionary with lmxlab parameter names. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If arch is not supported. |
ValueError
|
If pattern is required but not provided. |