Models
Language model base class and architecture config factories.
LanguageModel
lmxlab.models.base.LanguageModel
Bases: Module
Transformer language model assembled from config.
Uses ConfigurableBlock for each layer. Supports tied input/output embeddings and KV caching for generation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ModelConfig
|
Full model configuration. |
required |
Source code in src/lmxlab/models/base.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | |
_sinusoidal = block_cfg.position == 'sinusoidal'
instance-attribute
blocks = [(ConfigurableBlock(config.get_block_config(i))) for i in (range(config.n_layers))]
instance-attribute
config = config
instance-attribute
embed = nn.Embedding(config.vocab_size, block_cfg.d_model)
instance-attribute
embed_dropout = nn.Dropout(p=(block_cfg.dropout))
instance-attribute
final_norm = final_norm_cls(block_cfg)
instance-attribute
head = nn.Linear(block_cfg.d_model, config.vocab_size, bias=False)
instance-attribute
__call__(x, cache=None, return_hidden=False)
Forward pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
array
|
Token IDs of shape (batch, seq_len). |
required |
cache
|
list | None
|
Optional list of caches per layer. Cache types may be heterogeneous in hybrid models (KV tuples for attention, SSM state tuples for Mamba, None for identity layers). |
None
|
return_hidden
|
bool
|
If True, also return hidden states from final_norm (before lm_head projection). Used by Multi-Token Prediction. |
False
|
Returns:
| Type | Description |
|---|---|
tuple[array, list] | tuple[array, list, array]
|
Tuple of (logits, updated_caches) by default. |
tuple[array, list] | tuple[array, list, array]
|
If return_hidden is True, returns |
tuple[array, list] | tuple[array, list, array]
|
(logits, updated_caches, hidden_states). |
Source code in src/lmxlab/models/base.py
__init__(config)
Source code in src/lmxlab/models/base.py
_apply_mup_init(width_mult)
Rescale hidden layer weights for μP.
Scales hidden layer weight init by 1/√width_mult. Embedding weights are left unchanged (μP prescribes constant embedding init across widths).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
width_mult
|
float
|
d_model / base_d_model ratio. |
required |
Source code in src/lmxlab/models/base.py
Generation
lmxlab.models.generate
Autoregressive text generation with sampling strategies.
generate(model, prompt, max_tokens=100, temperature=1.0, top_k=0, top_p=1.0, repetition_penalty=1.0, stop_tokens=None)
Generate tokens autoregressively with KV caching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
LanguageModel
|
Language model to generate from. |
required |
prompt
|
array
|
Input token IDs of shape (batch, prompt_len). |
required |
max_tokens
|
int
|
Maximum number of new tokens to generate. |
100
|
temperature
|
float
|
Sampling temperature (0 = greedy). |
1.0
|
top_k
|
int
|
If > 0, only sample from top-k tokens. |
0
|
top_p
|
float
|
If < 1.0, use nucleus sampling. |
1.0
|
repetition_penalty
|
float
|
Penalty for repeating tokens (> 1.0 discourages repetition, 1.0 = no effect). |
1.0
|
stop_tokens
|
list[int] | None
|
List of token IDs that stop generation. When any batch element generates a stop token, generation stops for all. |
None
|
Returns:
| Type | Description |
|---|---|
array
|
Generated token IDs of shape |
array
|
(batch, prompt_len + generated_len). |
Source code in src/lmxlab/models/generate.py
stream_generate(model, prompt, max_tokens=100, temperature=1.0, top_k=0, top_p=1.0, repetition_penalty=1.0, stop_tokens=None)
Generate tokens one at a time, yielding each as produced.
This is the standard interface for interactive/streaming applications. Each token is yielded immediately after generation, enabling real-time display.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
LanguageModel
|
Language model to generate from. |
required |
prompt
|
array
|
Input token IDs of shape (1, prompt_len). |
required |
max_tokens
|
int
|
Maximum number of new tokens. |
100
|
temperature
|
float
|
Sampling temperature (0 = greedy). |
1.0
|
top_k
|
int
|
If > 0, only sample from top-k. |
0
|
top_p
|
float
|
If < 1.0, use nucleus sampling. |
1.0
|
repetition_penalty
|
float
|
Penalty for repeating tokens. |
1.0
|
stop_tokens
|
list[int] | None
|
Token IDs that stop generation. |
None
|
Yields:
| Type | Description |
|---|---|
int
|
Generated token IDs one at a time. |
Source code in src/lmxlab/models/generate.py
Config Factories
Each factory returns a ModelConfig that builds the corresponding
architecture when passed to LanguageModel.
GPT
lmxlab.models.gpt.gpt_config(vocab_size=50257, d_model=768, n_heads=12, n_layers=12, d_ff=3072, max_seq_len=1024, tie_embeddings=True, dropout=0.0, mup_base_width=None)
Create a GPT-style model configuration.
GPT uses: LayerNorm, standard MHA, standard FFN (GELU), sinusoidal positional encoding, pre-norm, bias everywhere.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size (default: GPT-2 BPE vocab). |
50257
|
d_model
|
int
|
Hidden dimension. |
768
|
n_heads
|
int
|
Number of attention heads. |
12
|
n_layers
|
int
|
Number of transformer layers. |
12
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
3072
|
max_seq_len
|
int
|
Maximum sequence length. |
1024
|
tie_embeddings
|
bool
|
Whether to tie input/output embeddings. |
True
|
dropout
|
float
|
Dropout rate. |
0.0
|
mup_base_width
|
int | None
|
Base width for μP. When set, enables μP attention scaling and logit scaling. |
None
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a GPT-style model. |
Source code in src/lmxlab/models/gpt.py
lmxlab.models.gpt.gpt_tiny()
Tiny GPT for testing (d=64, 2 layers, 2 heads).
LLaMA
lmxlab.models.llama.llama_config(vocab_size=32000, d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32, d_ff=11008, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False, dropout=0.0, mup_base_width=None)
Create a LLaMA-style model configuration.
LLaMA uses: RMSNorm, GQA, GatedFFN (SwiGLU), RoPE, pre-norm, no bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
32000
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of query heads. |
32
|
n_kv_heads
|
int
|
Number of KV heads (for GQA). |
8
|
n_layers
|
int
|
Number of transformer layers. |
32
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
11008
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie input/output embeddings. |
False
|
dropout
|
float
|
Dropout rate. |
0.0
|
mup_base_width
|
int | None
|
Base width for μP. When set, enables μP attention scaling and logit scaling. |
None
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a LLaMA-style model. |
Source code in src/lmxlab/models/llama.py
lmxlab.models.llama.llama_tiny()
Tiny LLaMA for testing (d=64, 2 layers, 4 heads, 2 kv).
Gemma
lmxlab.models.gemma.gemma_config(vocab_size=256000, d_model=2048, n_heads=8, n_kv_heads=1, n_layers=18, d_ff=16384, max_seq_len=8192, rope_theta=10000.0, tie_embeddings=True)
Create a Gemma-style model configuration.
Gemma uses: RMSNorm, GQA (multi-query), GatedFFN (GeGLU), RoPE, pre-norm, no bias, tied embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
256000
|
d_model
|
int
|
Hidden dimension. |
2048
|
n_heads
|
int
|
Number of query heads. |
8
|
n_kv_heads
|
int
|
Number of KV heads. |
1
|
n_layers
|
int
|
Number of transformer layers. |
18
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
16384
|
max_seq_len
|
int
|
Maximum sequence length. |
8192
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
True
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Gemma-style model. |
Source code in src/lmxlab/models/gemma.py
lmxlab.models.gemma.gemma_tiny()
Gemma 3 (Sliding Window)
lmxlab.models.gemma3.gemma3_config(vocab_size=256000, d_model=2048, n_heads=8, n_kv_heads=4, n_layers=26, d_ff=16384, max_seq_len=8192, rope_theta=10000.0, window_size=4096, global_every=6, tie_embeddings=True)
Create a Gemma 3-style model configuration.
Gemma 3 interleaves local (sliding window) and global
attention layers. Every global_every-th layer (0-indexed,
i.e. layers 5, 11, 17, ...) uses full global GQA; all other
layers use sliding window GQA with the given window size.
Uses: RMSNorm, GatedFFN (GeGLU), RoPE, pre-norm, no bias, tied embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
256000
|
d_model
|
int
|
Hidden dimension. |
2048
|
n_heads
|
int
|
Number of query heads. |
8
|
n_kv_heads
|
int
|
Number of KV heads. |
4
|
n_layers
|
int
|
Number of transformer layers. |
26
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
16384
|
max_seq_len
|
int
|
Maximum sequence length. |
8192
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
window_size
|
int
|
Sliding window size for local layers. |
4096
|
global_every
|
int
|
Place a global attention layer every N layers (1-indexed: layer global_every-1, 2*global_every-1, ...). |
6
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
True
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Gemma 3-style model. |
Source code in src/lmxlab/models/gemma3.py
lmxlab.models.gemma3.gemma3_tiny()
Tiny Gemma 3 for testing (4 layers, global every 4th).
Source code in src/lmxlab/models/gemma3.py
DeepSeek V2 (MLA)
lmxlab.models.deepseek.deepseek_config(vocab_size=102400, d_model=5120, n_heads=128, n_layers=60, d_ff=12288, kv_lora_rank=512, q_lora_rank=1536, rope_dim=64, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False)
Create a DeepSeek V2-style model configuration.
DeepSeek V2 uses: RMSNorm, MLA, GatedFFN (SwiGLU), decoupled RoPE, pre-norm, no bias.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
102400
|
d_model
|
int
|
Hidden dimension. |
5120
|
n_heads
|
int
|
Number of attention heads. |
128
|
n_layers
|
int
|
Number of transformer layers. |
60
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
12288
|
kv_lora_rank
|
int
|
Latent dimension for KV compression. |
512
|
q_lora_rank
|
int
|
Latent dimension for Q compression. |
1536
|
rope_dim
|
int
|
Number of head dims for RoPE. |
64
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a DeepSeek V2-style model. |
Source code in src/lmxlab/models/deepseek.py
lmxlab.models.deepseek.deepseek_tiny()
Tiny DeepSeek for testing (d=64, 2 layers, 4 heads).
Source code in src/lmxlab/models/deepseek.py
DeepSeek V3 (MLA + MoE)
lmxlab.models.deepseek.deepseek_v3_config(vocab_size=129280, d_model=7168, n_heads=128, n_layers=61, d_ff=18432, kv_lora_rank=512, q_lora_rank=1536, rope_dim=64, n_experts=256, top_k_experts=8, n_shared_experts=1, n_dense_layers=1, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False)
Create a DeepSeek V3 model configuration.
DeepSeek V3 uses MLA attention with SharedExpertMoE FFN
for most layers. The first n_dense_layers use dense
GatedFFN instead of MoE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
129280
|
d_model
|
int
|
Hidden dimension. |
7168
|
n_heads
|
int
|
Number of attention heads. |
128
|
n_layers
|
int
|
Number of transformer layers. |
61
|
d_ff
|
int
|
Per-expert FFN intermediate dimension. |
18432
|
kv_lora_rank
|
int
|
Latent dimension for KV compression. |
512
|
q_lora_rank
|
int
|
Latent dimension for Q compression. |
1536
|
rope_dim
|
int
|
Number of head dims for RoPE. |
64
|
n_experts
|
int
|
Number of routed experts. |
256
|
top_k_experts
|
int
|
Experts activated per token. |
8
|
n_shared_experts
|
int
|
Number of shared (always-on) experts. |
1
|
n_dense_layers
|
int
|
First N layers use dense FFN. |
1
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a DeepSeek V3 model. |
References
DeepSeek-V3 (DeepSeek-AI, 2025, arXiv:2412.19437).
Source code in src/lmxlab/models/deepseek.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
lmxlab.models.deepseek.deepseek_v3_tiny()
Tiny DeepSeek V3 for testing.
4 layers (1 dense + 3 MoE), d=64, 4 experts.
Source code in src/lmxlab/models/deepseek.py
Mixtral (MoE)
lmxlab.models.mixtral.mixtral_config(vocab_size=32000, d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32, d_ff=14336, n_experts=8, top_k_experts=2, max_seq_len=32768, rope_theta=1000000.0, tie_embeddings=False)
Create a Mixtral-style model configuration.
Mixtral uses GQA attention with MoE FFN: each token is routed to top-k of n_experts GatedFFN (SwiGLU) experts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
32000
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of query heads. |
32
|
n_kv_heads
|
int
|
Number of KV heads. |
8
|
n_layers
|
int
|
Number of transformer layers. |
32
|
d_ff
|
int
|
Per-expert feed-forward dimension. |
14336
|
n_experts
|
int
|
Number of expert FFNs. |
8
|
top_k_experts
|
int
|
Experts per token. |
2
|
max_seq_len
|
int
|
Maximum sequence length. |
32768
|
rope_theta
|
float
|
RoPE base frequency. |
1000000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Mixtral-style model. |
Source code in src/lmxlab/models/mixtral.py
lmxlab.models.mixtral.mixtral_tiny()
Tiny Mixtral for testing (with MoE).
Source code in src/lmxlab/models/mixtral.py
Llama 4 (iRoPE + Chunked Attention + MoE)
lmxlab.models.llama4.llama4_scout_config(vocab_size=202048, d_model=5120, n_heads=40, n_kv_heads=8, n_layers=48, d_ff=8192, n_experts=16, top_k_experts=1, n_shared_experts=1, attention_chunk_size=8192, nope_every=4, max_seq_len=65536, rope_theta=500000.0, tie_embeddings=False)
Create a Llama 4 Scout model configuration.
Uses iRoPE pattern: chunked local attention (3/4 layers) interleaved with full NoPE attention (1/4 layers). All layers use SharedExpertMoE FFN.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
202048
|
d_model
|
int
|
Hidden dimension. |
5120
|
n_heads
|
int
|
Number of query heads. |
40
|
n_kv_heads
|
int
|
Number of KV heads. |
8
|
n_layers
|
int
|
Number of transformer layers. |
48
|
d_ff
|
int
|
Per-expert FFN intermediate dimension. |
8192
|
n_experts
|
int
|
Number of routed experts. |
16
|
top_k_experts
|
int
|
Experts activated per token. |
1
|
n_shared_experts
|
int
|
Number of shared experts. |
1
|
attention_chunk_size
|
int
|
Chunk size for local attention. |
8192
|
nope_every
|
int
|
Place a NoPE (full GQA) layer every N. |
4
|
max_seq_len
|
int
|
Maximum sequence length. |
65536
|
rope_theta
|
float
|
RoPE base frequency. |
500000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Llama 4 Scout model. |
Source code in src/lmxlab/models/llama4.py
lmxlab.models.llama4.llama4_scout_tiny()
Tiny Llama 4 Scout for testing.
4 layers (3 chunked + 1 NoPE), d=64, 4 experts.
Source code in src/lmxlab/models/llama4.py
Nemotron (Mamba-2 + LatentMoE + Attention)
lmxlab.models.nemotron.nemotron3_config(hybrid_override_pattern='MEME*ME*', vocab_size=256000, d_model=4096, n_heads=32, n_kv_heads=2, d_ff=16384, mamba_n_heads=128, mamba_head_dim=64, ssm_state_size=128, mamba_expand=2, mamba_n_groups=8, mamba_chunk_size=128, n_experts=512, top_k_experts=22, moe_latent_size=1024, moe_d_ff=1024, shared_expert_d_ff=16384, moe_routed_scaling_factor=5.0, moe_n_groups=8, moe_topk_groups=4, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False, conv_kernel_size=4)
Create a Nemotron 3 hybrid model configuration.
Builds three base BlockConfigs (attention, MoE, Mamba) and maps the pattern string to per-layer configs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hybrid_override_pattern
|
str
|
Layer type pattern (M/E/-/*). |
'MEME*ME*'
|
vocab_size
|
int
|
Vocabulary size. |
256000
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of attention heads (for * layers). |
32
|
n_kv_heads
|
int
|
Number of KV heads (for * layers). |
2
|
d_ff
|
int
|
Dense FFN intermediate dimension. |
16384
|
mamba_n_heads
|
int
|
Number of Mamba SSM heads. |
128
|
mamba_head_dim
|
int
|
Dimension per Mamba head. |
64
|
ssm_state_size
|
int
|
SSM state dimension N. |
128
|
mamba_expand
|
int
|
Mamba inner dimension multiplier. |
2
|
mamba_n_groups
|
int
|
Number of B/C sharing groups. |
8
|
mamba_chunk_size
|
int
|
Chunk size for SSD parallel form. |
128
|
n_experts
|
int
|
Total number of routed experts. |
512
|
top_k_experts
|
int
|
Number of experts per token. |
22
|
moe_latent_size
|
int
|
Latent dimension for MoE routing. |
1024
|
moe_d_ff
|
int
|
Per-expert FFN intermediate dimension. |
1024
|
shared_expert_d_ff
|
int
|
Shared expert FFN dimension. |
16384
|
moe_routed_scaling_factor
|
float
|
Routed expert output scale. |
5.0
|
moe_n_groups
|
int
|
Number of expert groups for selection. |
8
|
moe_topk_groups
|
int
|
Number of top groups to select from. |
4
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
conv_kernel_size
|
int
|
Mamba conv kernel size. |
4
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Nemotron 3 hybrid model. |
Source code in src/lmxlab/models/nemotron.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
lmxlab.models.nemotron.nemotron3_tiny()
Tiny Nemotron 3 for testing.
4 layers (MEM*), d=64, small experts.
Source code in src/lmxlab/models/nemotron.py
Falcon H1 (Mamba-2 Hybrid)
lmxlab.models.falcon.falcon_h1_config(hybrid_pattern='MMMM*MMM*MMM*MMM*', vocab_size=65024, d_model=4096, n_heads=32, n_kv_heads=8, d_ff=14336, mamba_n_heads=64, mamba_head_dim=64, ssm_state_size=128, mamba_expand=2, mamba_n_groups=8, mamba_chunk_size=256, conv_kernel_size=4, max_seq_len=8192, rope_theta=500000.0, tie_embeddings=False)
Create a Falcon H1 hybrid model configuration.
Falcon H1 is a hybrid Mamba-2 + GQA model. Mamba-2 layers handle most of the sequence mixing, with periodic GQA layers for global attention. Both layer types have their own GatedFFN (SwiGLU).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hybrid_pattern
|
str
|
Layer type pattern (M=Mamba-2, *=GQA). |
'MMMM*MMM*MMM*MMM*'
|
vocab_size
|
int
|
Vocabulary size. |
65024
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of attention heads (for GQA layers). |
32
|
n_kv_heads
|
int
|
Number of KV heads (for GQA layers). |
8
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
14336
|
mamba_n_heads
|
int
|
Number of Mamba SSM heads. |
64
|
mamba_head_dim
|
int
|
Dimension per Mamba head. |
64
|
ssm_state_size
|
int
|
SSM state dimension N. |
128
|
mamba_expand
|
int
|
Mamba inner dimension multiplier. |
2
|
mamba_n_groups
|
int
|
Number of B/C sharing groups. |
8
|
mamba_chunk_size
|
int
|
Chunk size for SSD parallel form. |
256
|
conv_kernel_size
|
int
|
Mamba conv kernel size. |
4
|
max_seq_len
|
int
|
Maximum sequence length. |
8192
|
rope_theta
|
float
|
RoPE base frequency. |
500000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Falcon H1 hybrid model. |
Source code in src/lmxlab/models/falcon.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | |
lmxlab.models.falcon.falcon_h1_tiny()
Tiny Falcon H1 for testing.
4 layers (MMM*), d=64.
Source code in src/lmxlab/models/falcon.py
Jamba (Mamba-2 + MoE)
lmxlab.models.jamba.jamba_config(vocab_size=65536, d_model=4096, n_heads=32, n_kv_heads=8, n_layers=32, d_ff=14336, mamba_n_heads=64, mamba_head_dim=64, ssm_state_size=16, mamba_expand=2, mamba_n_groups=1, mamba_chunk_size=256, conv_kernel_size=4, n_experts=16, top_k_experts=2, moe_every=2, attn_every=4, max_seq_len=4096, rope_theta=10000.0, tie_embeddings=False)
Create a Jamba model configuration.
Jamba alternates Mamba-2 and GQA layers in an MMMA pattern (3 Mamba + 1 GQA per cycle). Even-indexed attention layers use MoE FFN, others use dense GatedFFN.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
65536
|
d_model
|
int
|
Hidden dimension. |
4096
|
n_heads
|
int
|
Number of attention heads (for GQA layers). |
32
|
n_kv_heads
|
int
|
Number of KV heads (for GQA layers). |
8
|
n_layers
|
int
|
Number of layers. |
32
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
14336
|
mamba_n_heads
|
int
|
Number of Mamba SSM heads. |
64
|
mamba_head_dim
|
int
|
Dimension per Mamba head. |
64
|
ssm_state_size
|
int
|
SSM state dimension N. |
16
|
mamba_expand
|
int
|
Mamba inner dimension multiplier. |
2
|
mamba_n_groups
|
int
|
Number of B/C sharing groups. |
1
|
mamba_chunk_size
|
int
|
Chunk size for SSD parallel form. |
256
|
conv_kernel_size
|
int
|
Mamba conv kernel size. |
4
|
n_experts
|
int
|
Number of routed experts (for MoE layers). |
16
|
top_k_experts
|
int
|
Experts activated per token. |
2
|
moe_every
|
int
|
Place MoE FFN every N attention layers. |
2
|
attn_every
|
int
|
Place an attention layer every N layers. |
4
|
max_seq_len
|
int
|
Maximum sequence length. |
4096
|
rope_theta
|
float
|
RoPE base frequency. |
10000.0
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Jamba model. |
Source code in src/lmxlab/models/jamba.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
lmxlab.models.jamba.jamba_tiny()
Tiny Jamba for testing.
8 layers (MMMA pattern x2), d=64, 4 MoE experts.
Source code in src/lmxlab/models/jamba.py
Qwen 3.5 (DeltaNet Hybrid)
lmxlab.models.qwen35.qwen35_config(vocab_size=151936, d_model=2048, n_heads=16, n_kv_heads=4, n_layers=28, d_ff=5504, max_seq_len=32768, rope_theta=1000000.0, global_every=4, tie_embeddings=False)
Create a Qwen 3.5-style model configuration.
Qwen 3.5 interleaves Gated DeltaNet (linear attention) and
standard GQA layers. Every global_every-th layer uses
full GQA; all other layers use Gated DeltaNet.
Uses: RMSNorm, GatedFFN (SwiGLU), RoPE (for GQA layers), short causal convolutions (for DeltaNet layers), no bias.
The 3:1 hybrid ratio (75% DeltaNet, 25% GQA) balances efficiency and expressiveness: - DeltaNet: O(d^2) per token, fixed-size state, no KV cache - GQA: O(n^2) per token, growing KV cache, global context
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_size
|
int
|
Vocabulary size. |
151936
|
d_model
|
int
|
Hidden dimension. |
2048
|
n_heads
|
int
|
Number of attention heads. |
16
|
n_kv_heads
|
int
|
Number of KV heads (for GQA layers). |
4
|
n_layers
|
int
|
Number of transformer layers. |
28
|
d_ff
|
int
|
Feed-forward intermediate dimension. |
5504
|
max_seq_len
|
int
|
Maximum sequence length. |
32768
|
rope_theta
|
float
|
RoPE base frequency (for GQA layers). |
1000000.0
|
global_every
|
int
|
Place a GQA layer every N layers. |
4
|
tie_embeddings
|
bool
|
Whether to tie embeddings. |
False
|
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig for a Qwen 3.5-style model. |
Source code in src/lmxlab/models/qwen35.py
lmxlab.models.qwen35.qwen35_tiny()
Tiny Qwen 3.5 for testing (4 layers, global every 4th).
Source code in src/lmxlab/models/qwen35.py
Weight Conversion
lmxlab.models.convert.load_from_hf(repo_id, revision=None, dtype=mx.float16, quantize=None)
Download and load a HuggingFace model into lmxlab.
Requires the huggingface_hub package.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HuggingFace repo ID (e.g., 'meta-llama/Llama-3.2-1B'). |
required |
revision
|
str | None
|
Git revision (branch, tag, or commit hash). |
None
|
dtype
|
Dtype
|
Target dtype for weights (default: float16). |
float16
|
quantize
|
int | None
|
If set, quantize the model to this many bits (4 or 8) after loading. Reduces memory usage. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[LanguageModel, ModelConfig]
|
Tuple of (loaded LanguageModel, ModelConfig). |
Raises:
| Type | Description |
|---|---|
ImportError
|
If huggingface_hub is not installed. |
ValueError
|
If model_type is not supported. |
Source code in src/lmxlab/models/convert.py
516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 | |
lmxlab.models.convert.config_from_hf(hf_config)
Create a ModelConfig from a HuggingFace config dict.
Reads config.json fields and maps them to lmxlab's
BlockConfig and ModelConfig.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hf_config
|
dict[str, Any]
|
Parsed HuggingFace config.json dict. |
required |
Returns:
| Type | Description |
|---|---|
ModelConfig
|
ModelConfig matching the HF model architecture. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If model_type is not supported. |
Source code in src/lmxlab/models/convert.py
lmxlab.models.convert.convert_weights(hf_weights, arch, pattern=None)
Convert HuggingFace weight dict to lmxlab naming.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hf_weights
|
dict[str, array]
|
Dictionary of HF parameter names to arrays. |
required |
arch
|
str
|
Architecture name (e.g., 'llama', 'nemotron_h'). |
required |
pattern
|
str | None
|
Hybrid override pattern (required for nemotron_h architecture). |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, array]
|
Dictionary with lmxlab parameter names. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If arch is not supported. |
ValueError
|
If pattern is required but not provided. |