Developer Log
Design decisions, trade-offs, and lessons from building lmxlab.
Design Decisions
Config Factories Over Class Hierarchies
GPT, LLaMA, DeepSeek, etc. are not separate model classes. They are
config factory functions returning ModelConfig.
After studying the original LMT (PyTorch version), it became clear that
transformer architectures differ in configuration, not structure. A
LlamaModel and a GPTModel perform the same operations (embed, attend,
feed-forward, project) with different component choices. Encoding this
as inheritance (LlamaModel(BaseModel)) creates artificial boundaries.
With config factories, switching from GPT to LLaMA is changing a function call, not a class hierarchy:
# These produce different configs, not different model classes
config = gpt_config(d_model=512, n_heads=8, n_layers=6)
config = llama_config(d_model=512, n_heads=8, n_kv_heads=4, n_layers=6)
# Same model class for both
model = LanguageModel(config)
Trade-off: this design makes individual architecture code harder to find
(no class LlamaModel to grep for). Well-documented config factories in
models/*.py mitigate this.
Registry Pattern for Components
Components (attention, FFN, norm, position) register themselves by string
name. ConfigurableBlock resolves them at construction time.
This decouples component implementation from block assembly. Adding a new
attention type (MLA, sliding window) requires only writing the module and
registering it, with no changes to ConfigurableBlock or existing code.
An early mistake: MoE FFNs needed special registration because their
constructors take extra arguments (n_experts, top_k). The fix was to
have BlockConfig carry MoE fields, with MoE constructors reading from
the config with optional overrides.
Explicit mx.eval() Boundaries
MLX operations are lazy: nothing computes until mx.eval() is called.
This requires explicit eval boundaries at specific points:
- After model construction (
mx.eval(model.parameters())) - During generation (eval each token before feeding it back)
- At training step boundaries (eval loss for logging)
Omitting mx.eval() after model construction can cause the first
training step to be extremely slow, because it triggers both
initialization and the first forward pass in one graph.
Lessons Learned
Unified Memory Changes the Trade-offs
On CUDA, the CPU-GPU memory boundary is a constant concern: batch sizes, gradient accumulation, offloading. On Apple Silicon with MLX, unified memory means:
- No
.to(device)calls, as arrays live everywhere at once - No data transfer bottleneck between CPU and GPU
- The memory ceiling is system RAM (not separate GPU VRAM)
- Speculative decoding benefits from shared memory between draft and verify models
Open questions remain: whether gradient accumulation matters less without transfer overhead, and whether different optimizer choices win when memory access patterns change. These are targets for the experiment framework.
mx.compile Is Not Free
Compiling the training step with mx.compile provides significant speedups
(1.3-2x typical, potentially more for larger models), but it constrains
the available operations:
- No Python control flow that depends on array values
- Must declare
inputsandoutputsexplicitly - Graph changes (like modifying model structure) require recompilation
For educational code, we default to compile_step=True but make it easy
to disable for debugging. The benchmark_compile.py recipe measures the
actual speedup on a given machine.
LoRA and QLoRA on Unified Memory
Parameter-efficient fine-tuning has different economics on unified memory. On CUDA, QLoRA's primary value is fitting a larger model in limited VRAM. On Apple Silicon, the trade-offs differ:
- The memory savings still matter (system RAM is shared with the OS)
- But the performance characteristics may differ (no quantize-dequantize transfer between devices)
- Adapter save/load is valuable regardless of memory architecture, as small adapter files can be shared independently of multi-GB base models
Testing ML Code
Standard unit testing (assertEqual) does not apply well to ML code
because outputs are stochastic. Behavioral tests are used instead:
- Invariance tests: same input and seed produce the same output
- Directional tests: loss decreases after training steps
- Shape tests: output dimensions match
(batch, seq_len, vocab_size) - Minimum functionality tests: outputs are finite (no NaN, no Inf)
Architecture Notes
Attention Variants Are Configurations, Not Architectures
| Variant | Distinguishing property |
|---|---|
| MHA | Full attention: each head has independent K, V |
| GQA | Share K, V across groups of query heads |
| MLA | Compress KV into a low-rank latent, reconstruct at attention time |
| Sliding Window | Limit attention span per layer (local vs global) |
| GatedDeltaNet | Linear attention with gated delta rule (no softmax) |
These all implement the same interface: (x, mask, cache) -> (output, cache).
The difference is how they compute and store key-value state.
MoE Is Just a Different FFN
Mixture of Experts replaces the dense FFN with a routed sparse FFN.
The router selects top-k experts per token, and the outputs are combined
by router weights. From ConfigurableBlock's perspective, it's just
another FFN, registered as "moe" instead of "gated".
The interesting part is load balancing. Without it, the router collapses to always picking the same experts. We implement bias-based balancing (SharedExpertMoEFFN) which avoids the auxiliary loss used in some implementations.
HuggingFace Integration
Components
Connecting to the HuggingFace ecosystem required three components:
-
load_from_hf(): downloads and converts pretrained weights from the Hub. Maps HF config keys toModelConfig, converts weight names, and handles architecture-specific details (rotated QKV in LLaMA, fused gate-up in Gemma). -
HFTokenizer: wrapsAutoTokenizerwith the lmxlabTokenizerprotocol, since pretrained models expect their own tokenizer rather than a character or tiktoken tokenizer. -
HFDataset: wrapsdatasets.load_datasetwith streaming support. Yields(input, target)batches by tokenizing on-the-fly from a token buffer.
All three use lazy imports (from transformers import ... inside
__init__), keeping the base library free of heavy dependencies.
The transformers package is only required when HF features are used.
Weight conversion proved to be the most difficult component. Different architectures store weights in different formats (some fuse QKV, some do not; some transpose FFN weights, some do not). The solution was a mapping table per architecture, with clear error messages for missing keys.
Streaming for Large Datasets
HFDataset.batch_iterator() uses a token buffer pattern: accumulate
tokens from the dataset stream until there are enough for one batch, yield
it, and keep the remainder. The full dataset need not fit in memory,
which matters for multi-GB corpora.
The streaming=True flag enables HuggingFace's iterable dataset mode,
which downloads data on demand instead of caching the full dataset locally.
Advanced Training Features
DPO, GRPO, and MTP
These three training objectives improve model quality in different ways:
| Method | Signal Source | Key Idea |
|---|---|---|
| DPO | Preference pairs | Learn from "A is better than B" without a reward model |
| GRPO | Scalar rewards | Group-relative normalization of per-completion rewards |
| MTP | Same data, richer targets | Predict multiple future tokens, not just the next one |
DPO replaces RLHF's reward model + PPO pipeline with a single loss function. The mathematical insight: the optimal policy under the KL- constrained reward maximization objective has a closed-form solution that only needs the policy and reference model log probabilities.
GRPO is closer to classic policy gradient but normalizes rewards within each group of completions (zero mean, unit variance). This removes the need for a value function baseline and makes training more stable.
MTP is orthogonal: it does not change the objective but enriches the training signal. Each position predicts not just the next token but the next 2-4 tokens via lightweight auxiliary heads. This provides richer gradients and enables speculative decoding at inference time (the auxiliary heads serve as draft predictors).
Curriculum Learning: Start Easy
Length curriculum (short sequences → long sequences) follows a simple principle: let the model learn basic patterns on short context before tackling long-range dependencies. Empirically, this often converges faster than training on the final sequence length from the start.
The implementation uses linear interpolation of sequence length across stages, with no complex scheduling required.
Pre-Registered Experiment Plans
Following Platt's strong inference and Chamberlin's multiple working hypotheses, each experiment below specifies competing hypotheses and predictions before running. This guards against confirmation bias and the garden of forking paths (Gelman & Loken, 2013).
Experiment 1: GPT-to-LLaMA Feature Ablation
Question: When adding LLaMA-style features to a GPT baseline one at a time, which individual change contributes most to improved training dynamics?
Competing hypotheses:
- H1 (Attention dominates): GQA provides the largest improvement because it enables more efficient use of the parameter budget (sharing KV heads frees capacity for other components).
- H2 (FFN dominates): SwiGLU provides the largest improvement because the gating mechanism gives the network better gradient flow and expressiveness.
- H3 (Normalization dominates): RMSNorm + no bias provides the largest improvement because it stabilizes training, allowing higher learning rates.
- H4 (Interactions dominate): No single change provides more than 20% of the total improvement; the benefit comes from combining features (non-linear interaction).
Design:
| Run | Attention | FFN | Norm | Position | Bias |
|---|---|---|---|---|---|
| Baseline | MHA | Standard | LayerNorm | Sinusoidal | Yes |
| +GQA | GQA | Standard | LayerNorm | Sinusoidal | Yes |
| +SwiGLU | MHA | Gated | LayerNorm | Sinusoidal | Yes |
| +RMSNorm | MHA | Standard | RMSNorm | Sinusoidal | No |
| +RoPE | MHA | Standard | LayerNorm | RoPE | Yes |
| Full LLaMA | GQA | Gated | RMSNorm | RoPE | No |
Protocol: 5-minute time budget per run (autoresearch pattern),
3 seeds each, d_model=256, n_layers=6, Shakespeare dataset. Report
val_bpb mean +/- std. Use the ablation_gpt_to_llama.py recipe.
Analysis plan: ANOVA across single-feature runs, then compare sum of individual improvements to Full LLaMA improvement to test H4. Report effect sizes (Cohen's d) relative to baseline, not just statistical significance.
What would falsify each hypothesis:
- H1: If +GQA improvement < 20% of total (baseline to full LLaMA)
- H2: If +SwiGLU improvement < 20% of total
- H3: If +RMSNorm improvement < 20% of total
- H4: If any single feature contributes > 50% of total
Experiment 2: mx.compile Coverage Analysis
Question: How does the speedup from mx.compile scale as we
progressively compile more of the training pipeline?
Competing hypotheses:
- H1 (Graph size dominates): Speedup scales roughly linearly with the fraction of computation compiled, because each fused operation saves one memory round-trip.
- H2 (Diminishing returns): The first compilation (the training step) captures most of the benefit; additional compilation of evaluation or data preprocessing provides negligible speedup.
- H3 (Overhead at small scale): For tiny models, compilation overhead (tracing, first-step latency) dominates, and compiled code is actually slower for the first N steps.
Design:
| Config | What's compiled |
|---|---|
| None | Nothing compiled |
| Step only | Training step (_single_step) |
| Step + eval | Training step + evaluation forward pass |
Measure: steps/second (excluding first 5 warmup steps), peak memory, time-to-first-step. Three model sizes: tiny (64d/2L), small (256d/6L), medium (512d/8L). 3 seeds each.
Analysis plan: Plot speedup ratio vs model size. Report time-to-first-step separately (compilation overhead). If H3 is correct, there should be a crossover point where compilation becomes net positive.
Experiment 3: Optimizer Comparison on Unified Memory
Question: Does Apple Silicon's unified memory architecture change which optimizers work best, compared to published CUDA results?
Competing hypotheses:
- H1 (Same story): AdamW dominates regardless of hardware, because optimizer dynamics depend on loss landscape geometry not memory architecture.
- H2 (Memory-efficient wins): SGD with momentum or Adafactor perform comparatively better on Apple Silicon because they use less optimizer state, leaving more unified memory for larger batch sizes.
- H3 (Bandwidth matters): Optimizers with fewer memory accesses per step (SGD) gain a disproportionate advantage because Apple Silicon has lower memory bandwidth than datacenter GPUs.
Design: Train LLaMA-small (256d/6L) on Shakespeare with AdamW, SGD+momentum, Adafactor, and Lion. Same learning rate sweep for each (log-scale: 1e-4, 3e-4, 1e-3, 3e-3). 5-minute time budget, 3 seeds.
Metrics: Best val_bpb achieved, steps/second, peak memory usage.
Analysis plan: Compare best-run val_bpb across optimizers (paired across seeds). Report both best-of-sweep and mean-across-sweep to distinguish "works with tuning" from "works robustly." Compare steps/second ratios to published CUDA ratios for the same optimizers.
Experiment 4: KV Cache Reduction with MLA
Question: Does MLA's ~57x KV cache compression translate to meaningful practical benefits on unified memory?
Competing hypotheses:
- H1 (Memory benefit): MLA enables substantially longer generation (higher max_tokens before OOM) because KV cache is the binding memory constraint during inference.
- H2 (No practical benefit): On unified memory, the total memory pool is large enough that KV cache is not the binding constraint for typical sequence lengths (< 8K tokens). MLA's benefit only appears at very long contexts.
- H3 (Speed benefit): MLA is faster per-token during generation because reading a smaller KV cache from memory is faster (bandwidth-bound operation).
Design: Compare DeepSeek-style MLA vs standard MHA at matched parameter counts. Generate sequences of increasing length (512, 1K, 2K, 4K, 8K tokens). Measure: tokens/second, peak memory, and maximum achievable sequence length before OOM.
Analysis plan: Plot tokens/second and memory vs sequence length for both. If H1 is correct, there should be a clear divergence point. If H3 is correct, MLA should be consistently faster even at short lengths.
Experiment 5: What Can You Train in 5 Minutes?
Question: What is the best validation BPB achievable in exactly 5 minutes of wall-clock training on an M-series Mac?
A fixed time budget eliminates timing confounds and makes all experiments directly comparable regardless of architecture complexity.
Protocol:
- Start with GPT-tiny on Shakespeare (baseline)
- Iterate: modify one thing (architecture, hyperparameter, data), train for 5 minutes, record val_bpb
- Keep changes that improve; discard changes that don't
- Git-as-experiment-infra: each run is a commit on
experiments/5min-*branch
Metrics: val_bpb (primary), training loss curve shape, parameter count (prefer simpler models at equal performance).
Simplicity bias: If two configurations achieve similar val_bpb, prefer the one with fewer parameters or simpler architecture. A 0.001 improvement from deleting a feature is worth more than the same improvement from adding complexity.
Research Directions
Longer-term areas for investigation, each building on existing lmxlab infrastructure.
Test-Time Compute Scaling
Rather than training a larger model, additional compute can be spent at inference time. This is the principle behind chain-of-thought prompting, tree search, best-of-N sampling, and recent "thinking" paradigms.
Existing primitives:
best_of_n(): generate N completions, select the highest log-probability candidatemajority_vote(): generate N answers, return the most commonspeculative_decoding: draft + verify for faster generation
Open directions:
- Scaling laws for inference compute: whether quality scales log-linearly with N in best-of-N, or plateaus.
- Tree search at generation time: comparing MCTS over token sequences against best-of-N at matched compute.
- Iterative refinement (generate, critique, regenerate): whether self-evaluation loops improve output, and the optimal iteration count.
- Budget-optimal allocation: whether one large model or many small models with voting is preferable at fixed compute. On unified memory, there is no transfer overhead between model copies.
This extends Experiment 5 from training to inference: the best output quality achievable under a fixed inference compute budget.
Knowledge Distillation
A smaller "student" model is trained to mimic a larger "teacher" model. The student learns from the teacher's soft probability distributions, which carry more information than hard labels alone (Hinton et al., 2015).
Distillation connects training objectives, model capacity, and information theory. On Apple Silicon with limited memory, distilling a large model into a smaller one is directly useful. DPO and GRPO already implement reference-model patterns (comparing policy vs reference log-probabilities), and distillation uses the same machinery.
Planned components:
- Basic distillation loss: KL divergence between teacher and student
logits (temperature-scaled), as a training module alongside
dpo.pyandgrpo.py. - Online vs offline distillation. Offline: pre-compute teacher logits and store them. Online: run teacher forward pass during student training. The trade-off is memory vs flexibility.
- Layer-wise distillation: matching intermediate representations, not just final logits. This requires compatible architectures (same hidden dimension or a projection layer).
- Self-distillation: a model distills into a smaller version of itself, with potential for iterative compression.
Experiment: train a LLaMA-small teacher on Shakespeare, distill into a GPT-tiny student. Compare student quality against training from scratch at the same compute budget, and test whether the gap widens as the student-teacher size ratio increases.
RL with Verifiable Rewards (Code Generation)
Reinforcement learning works best when rewards are unambiguous. Code is a natural domain because correctness is verifiable: run the tests, check whether they pass. This eliminates the reward model noise that affects RLHF on open-ended text.
Existing primitives:
grpo_loss(): Group Relative Policy Optimization, taking scalar rewards per completionpass_at_k()/evaluate_pass_at_k(): code generation metrics scoring completions by test passagebest_of_n(): generate multiple completions, select the best
Planned pipeline:
- Problem format: function-completion tasks (given a docstring with examples, generate the function body), starting with arithmetic and string problems where test cases are easy to generate.
- Reward function: execute generated code in a sandbox, run test cases, return pass rate as the reward signal (binary or fractional).
- Training loop: for each problem, generate K completions, score each
with the reward function, compute GRPO loss. This connects
grpo_lossdirectly topass_at_kas the reward signal. - Difficulty curriculum: start with single-function simple logic, progress to harder problems.
Research questions:
- How many completions per problem (K) are needed for stable GRPO training? Theory favors more for variance reduction, but compute scales linearly.
- Does the model generalize from easy problems to hard ones, or overfit to the reward signal on seen problem types?
- How does RL fine-tuning compare to SFT on verified solutions? SFT is simpler but requires curated correct solutions; RL can learn from the model's own attempts.
- On unified memory, can code execution and model inference run simultaneously without contention?
This extends GRPO from the optimizer comparison in Experiment 3 to code
tasks, and extends the pass_at_k metric from passive measurement to
active training signal.
Open Questions
- How does training dynamics change between Apple Silicon and CUDA for the same architecture? Are learning rate sensitivities the same?
- What's the optimal
mx.eval()frequency? Too many evals waste time on synchronization; too few let the computation graph grow unboundedly. - Does MLA's KV compression provide real memory benefits on unified memory, or is the bottleneck elsewhere? (See Experiment 4.)
- Can we use MLX's compilation to automatically fuse attention + FFN in a single block (like FlashAttention does for attention alone)?
- How does the 3:1 DeltaNet:GQA ratio in Qwen 3.5 compare to other ratios (2:1, 4:1) on our educational-scale models?
- At what model size does compilation speedup become worth the first-step overhead? (See Experiment 2.)
- What are the scaling laws for inference compute on Apple Silicon? (See Test-Time Compute Scaling above.)
- Can distillation produce models that are better than training from scratch at the same size? (See Knowledge Distillation above.)
- How effective is RL with verifiable rewards compared to SFT on curated solutions? (See RL with Verifiable Rewards above.)