Developer Log

Design decisions, trade-offs, and lessons from building lmxlab.

Design Decisions

Config Factories Over Class Hierarchies

GPT, LLaMA, DeepSeek, etc. are not separate model classes. They are config factory functions returning ModelConfig.

After studying the original LMT (PyTorch version), it became clear that transformer architectures differ in configuration, not structure. A LlamaModel and a GPTModel perform the same operations (embed, attend, feed-forward, project) with different component choices. Encoding this as inheritance (LlamaModel(BaseModel)) creates artificial boundaries.

With config factories, switching from GPT to LLaMA is changing a function call, not a class hierarchy:

# These produce different configs, not different model classes
config = gpt_config(d_model=512, n_heads=8, n_layers=6)
config = llama_config(d_model=512, n_heads=8, n_kv_heads=4, n_layers=6)

# Same model class for both
model = LanguageModel(config)

Trade-off: this design makes individual architecture code harder to find (no class LlamaModel to grep for). Well-documented config factories in models/*.py mitigate this.

Registry Pattern for Components

Components (attention, FFN, norm, position) register themselves by string name. ConfigurableBlock resolves them at construction time.

This decouples component implementation from block assembly. Adding a new attention type (MLA, sliding window) requires only writing the module and registering it, with no changes to ConfigurableBlock or existing code.

An early mistake: MoE FFNs needed special registration because their constructors take extra arguments (n_experts, top_k). The fix was to have BlockConfig carry MoE fields, with MoE constructors reading from the config with optional overrides.

Explicit `mx.eval()` Boundaries

MLX operations are lazy: nothing computes until mx.eval() is called. This requires explicit eval boundaries at specific points:

After model construction (mx.eval(model.parameters()))
During generation (eval each token before feeding it back)
At training step boundaries (eval loss for logging)

Omitting mx.eval() after model construction can cause the first training step to be extremely slow, because it triggers both initialization and the first forward pass in one graph.

Lessons Learned

Unified Memory Changes the Trade-offs

On CUDA, the CPU-GPU memory boundary is a constant concern: batch sizes, gradient accumulation, offloading. On Apple Silicon with MLX, unified memory means:

No .to(device) calls, as arrays live everywhere at once
No data transfer bottleneck between CPU and GPU
The memory ceiling is system RAM (not separate GPU VRAM)
Speculative decoding benefits from shared memory between draft and verify models

Open questions remain: whether gradient accumulation matters less without transfer overhead, and whether different optimizer choices win when memory access patterns change. These are targets for the experiment framework.

`mx.compile` Is Not Free

Compiling the training step with mx.compile provides significant speedups (1.3-2x typical, potentially more for larger models), but it constrains the available operations:

No Python control flow that depends on array values
Must declare inputs and outputs explicitly
Graph changes (like modifying model structure) require recompilation

For educational code, we default to compile_step=True but make it easy to disable for debugging. The benchmark_compile.py recipe measures the actual speedup on a given machine.

LoRA and QLoRA on Unified Memory

Parameter-efficient fine-tuning has different economics on unified memory. On CUDA, QLoRA's primary value is fitting a larger model in limited VRAM. On Apple Silicon, the trade-offs differ:

The memory savings still matter (system RAM is shared with the OS)
But the performance characteristics may differ (no quantize-dequantize transfer between devices)
Adapter save/load is valuable regardless of memory architecture, as small adapter files can be shared independently of multi-GB base models

Testing ML Code

Standard unit testing (assertEqual) does not apply well to ML code because outputs are stochastic. Behavioral tests are used instead:

Invariance tests: same input and seed produce the same output
Directional tests: loss decreases after training steps
Shape tests: output dimensions match (batch, seq_len, vocab_size)
Minimum functionality tests: outputs are finite (no NaN, no Inf)

Architecture Notes

Attention Variants Are Configurations, Not Architectures

Variant	Distinguishing property
MHA	Full attention: each head has independent K, V
GQA	Share K, V across groups of query heads
MLA	Compress KV into a low-rank latent, reconstruct at attention time
Sliding Window	Limit attention span per layer (local vs global)
GatedDeltaNet	Linear attention with gated delta rule (no softmax)

These all implement the same interface: (x, mask, cache) -> (output, cache). The difference is how they compute and store key-value state.

MoE Is Just a Different FFN

Mixture of Experts replaces the dense FFN with a routed sparse FFN. The router selects top-k experts per token, and the outputs are combined by router weights. From ConfigurableBlock's perspective, it's just another FFN, registered as "moe" instead of "gated".

The interesting part is load balancing. Without it, the router collapses to always picking the same experts. We implement bias-based balancing (SharedExpertMoEFFN) which avoids the auxiliary loss used in some implementations.

HuggingFace Integration

Components

Connecting to the HuggingFace ecosystem required three components:

load_from_hf(): downloads and converts pretrained weights from the Hub. Maps HF config keys to ModelConfig, converts weight names, and handles architecture-specific details (rotated QKV in LLaMA, fused gate-up in Gemma).
HFTokenizer: wraps AutoTokenizer with the lmxlab Tokenizer protocol, since pretrained models expect their own tokenizer rather than a character or tiktoken tokenizer.
HFDataset: wraps datasets.load_dataset with streaming support. Yields (input, target) batches by tokenizing on-the-fly from a token buffer.

All three use lazy imports (from transformers import ... inside __init__), keeping the base library free of heavy dependencies. The transformers package is only required when HF features are used.

Weight conversion proved to be the most difficult component. Different architectures store weights in different formats (some fuse QKV, some do not; some transpose FFN weights, some do not). The solution was a mapping table per architecture, with clear error messages for missing keys.

Streaming for Large Datasets

HFDataset.batch_iterator() uses a token buffer pattern: accumulate tokens from the dataset stream until there are enough for one batch, yield it, and keep the remainder. The full dataset need not fit in memory, which matters for multi-GB corpora.

The streaming=True flag enables HuggingFace's iterable dataset mode, which downloads data on demand instead of caching the full dataset locally.

Advanced Training Features

DPO, GRPO, and MTP

These three training objectives improve model quality in different ways:

Method	Signal Source	Key Idea
DPO	Preference pairs	Learn from "A is better than B" without a reward model
GRPO	Scalar rewards	Group-relative normalization of per-completion rewards
MTP	Same data, richer targets	Predict multiple future tokens, not just the next one

DPO replaces RLHF's reward model + PPO pipeline with a single loss function. The mathematical insight: the optimal policy under the KL- constrained reward maximization objective has a closed-form solution that only needs the policy and reference model log probabilities.

GRPO is closer to classic policy gradient but normalizes rewards within each group of completions (zero mean, unit variance). This removes the need for a value function baseline and makes training more stable.

MTP is orthogonal: it does not change the objective but enriches the training signal. Each position predicts not just the next token but the next 2-4 tokens via lightweight auxiliary heads. This provides richer gradients and enables speculative decoding at inference time (the auxiliary heads serve as draft predictors).

Curriculum Learning: Start Easy

Length curriculum (short sequences → long sequences) follows a simple principle: let the model learn basic patterns on short context before tackling long-range dependencies. Empirically, this often converges faster than training on the final sequence length from the start.

The implementation uses linear interpolation of sequence length across stages, with no complex scheduling required.

Pre-Registered Experiment Plans

Following Platt's strong inference and Chamberlin's multiple working hypotheses, each experiment below specifies competing hypotheses and predictions before running. This guards against confirmation bias and the garden of forking paths (Gelman & Loken, 2013).

Experiment 1: GPT-to-LLaMA Feature Ablation

Question: When adding LLaMA-style features to a GPT baseline one at a time, which individual change contributes most to improved training dynamics?

Competing hypotheses:

H1 (Attention dominates): GQA provides the largest improvement because it enables more efficient use of the parameter budget (sharing KV heads frees capacity for other components).
H2 (FFN dominates): SwiGLU provides the largest improvement because the gating mechanism gives the network better gradient flow and expressiveness.
H3 (Normalization dominates): RMSNorm + no bias provides the largest improvement because it stabilizes training, allowing higher learning rates.
H4 (Interactions dominate): No single change provides more than 20% of the total improvement; the benefit comes from combining features (non-linear interaction).

Design:

Run	Attention	FFN	Norm	Position	Bias
Baseline	MHA	Standard	LayerNorm	Sinusoidal	Yes
+GQA	GQA	Standard	LayerNorm	Sinusoidal	Yes
+SwiGLU	MHA	Gated	LayerNorm	Sinusoidal	Yes
+RMSNorm	MHA	Standard	RMSNorm	Sinusoidal	No
+RoPE	MHA	Standard	LayerNorm	RoPE	Yes
Full LLaMA	GQA	Gated	RMSNorm	RoPE	No

Protocol: 5-minute time budget per run (autoresearch pattern), 3 seeds each, d_model=256, n_layers=6, Shakespeare dataset. Report val_bpb mean +/- std. Use the ablation_gpt_to_llama.py recipe.

Analysis plan: ANOVA across single-feature runs, then compare sum of individual improvements to Full LLaMA improvement to test H4. Report effect sizes (Cohen's d) relative to baseline, not just statistical significance.

What would falsify each hypothesis:

H1: If +GQA improvement < 20% of total (baseline to full LLaMA)
H2: If +SwiGLU improvement < 20% of total
H3: If +RMSNorm improvement < 20% of total
H4: If any single feature contributes > 50% of total

Experiment 2: mx.compile Coverage Analysis

Question: How does the speedup from mx.compile scale as we progressively compile more of the training pipeline?

Competing hypotheses:

H1 (Graph size dominates): Speedup scales roughly linearly with the fraction of computation compiled, because each fused operation saves one memory round-trip.
H2 (Diminishing returns): The first compilation (the training step) captures most of the benefit; additional compilation of evaluation or data preprocessing provides negligible speedup.
H3 (Overhead at small scale): For tiny models, compilation overhead (tracing, first-step latency) dominates, and compiled code is actually slower for the first N steps.

Design:

Config	What's compiled
None	Nothing compiled
Step only	Training step (`_single_step`)
Step + eval	Training step + evaluation forward pass

Measure: steps/second (excluding first 5 warmup steps), peak memory, time-to-first-step. Three model sizes: tiny (64d/2L), small (256d/6L), medium (512d/8L). 3 seeds each.

Analysis plan: Plot speedup ratio vs model size. Report time-to-first-step separately (compilation overhead). If H3 is correct, there should be a crossover point where compilation becomes net positive.

Experiment 3: Optimizer Comparison on Unified Memory

Question: Does Apple Silicon's unified memory architecture change which optimizers work best, compared to published CUDA results?

Competing hypotheses:

H1 (Same story): AdamW dominates regardless of hardware, because optimizer dynamics depend on loss landscape geometry not memory architecture.
H2 (Memory-efficient wins): SGD with momentum or Adafactor perform comparatively better on Apple Silicon because they use less optimizer state, leaving more unified memory for larger batch sizes.
H3 (Bandwidth matters): Optimizers with fewer memory accesses per step (SGD) gain a disproportionate advantage because Apple Silicon has lower memory bandwidth than datacenter GPUs.

Design: Train LLaMA-small (256d/6L) on Shakespeare with AdamW, SGD+momentum, Adafactor, and Lion. Same learning rate sweep for each (log-scale: 1e-4, 3e-4, 1e-3, 3e-3). 5-minute time budget, 3 seeds.

Metrics: Best val_bpb achieved, steps/second, peak memory usage.

Analysis plan: Compare best-run val_bpb across optimizers (paired across seeds). Report both best-of-sweep and mean-across-sweep to distinguish "works with tuning" from "works robustly." Compare steps/second ratios to published CUDA ratios for the same optimizers.

Experiment 4: KV Cache Reduction with MLA

Question: Does MLA's ~57x KV cache compression translate to meaningful practical benefits on unified memory?

Competing hypotheses:

H1 (Memory benefit): MLA enables substantially longer generation (higher max_tokens before OOM) because KV cache is the binding memory constraint during inference.
H2 (No practical benefit): On unified memory, the total memory pool is large enough that KV cache is not the binding constraint for typical sequence lengths (< 8K tokens). MLA's benefit only appears at very long contexts.
H3 (Speed benefit): MLA is faster per-token during generation because reading a smaller KV cache from memory is faster (bandwidth-bound operation).

Design: Compare DeepSeek-style MLA vs standard MHA at matched parameter counts. Generate sequences of increasing length (512, 1K, 2K, 4K, 8K tokens). Measure: tokens/second, peak memory, and maximum achievable sequence length before OOM.

Analysis plan: Plot tokens/second and memory vs sequence length for both. If H1 is correct, there should be a clear divergence point. If H3 is correct, MLA should be consistently faster even at short lengths.

Experiment 5: What Can You Train in 5 Minutes?

Question: What is the best validation BPB achievable in exactly 5 minutes of wall-clock training on an M-series Mac?

A fixed time budget eliminates timing confounds and makes all experiments directly comparable regardless of architecture complexity.

Protocol:

Start with GPT-tiny on Shakespeare (baseline)
Iterate: modify one thing (architecture, hyperparameter, data), train for 5 minutes, record val_bpb
Keep changes that improve; discard changes that don't
Git-as-experiment-infra: each run is a commit on experiments/5min-* branch

Metrics: val_bpb (primary), training loss curve shape, parameter count (prefer simpler models at equal performance).

Simplicity bias: If two configurations achieve similar val_bpb, prefer the one with fewer parameters or simpler architecture. A 0.001 improvement from deleting a feature is worth more than the same improvement from adding complexity.

Research Directions

Longer-term areas for investigation, each building on existing lmxlab infrastructure.

Test-Time Compute Scaling

Rather than training a larger model, additional compute can be spent at inference time. This is the principle behind chain-of-thought prompting, tree search, best-of-N sampling, and recent "thinking" paradigms.

Existing primitives:

best_of_n(): generate N completions, select the highest log-probability candidate
majority_vote(): generate N answers, return the most common
speculative_decoding: draft + verify for faster generation

Open directions:

Scaling laws for inference compute: whether quality scales log-linearly with N in best-of-N, or plateaus.
Tree search at generation time: comparing MCTS over token sequences against best-of-N at matched compute.
Iterative refinement (generate, critique, regenerate): whether self-evaluation loops improve output, and the optimal iteration count.
Budget-optimal allocation: whether one large model or many small models with voting is preferable at fixed compute. On unified memory, there is no transfer overhead between model copies.

This extends Experiment 5 from training to inference: the best output quality achievable under a fixed inference compute budget.

Knowledge Distillation

A smaller "student" model is trained to mimic a larger "teacher" model. The student learns from the teacher's soft probability distributions, which carry more information than hard labels alone (Hinton et al., 2015).

Distillation connects training objectives, model capacity, and information theory. On Apple Silicon with limited memory, distilling a large model into a smaller one is directly useful. DPO and GRPO already implement reference-model patterns (comparing policy vs reference log-probabilities), and distillation uses the same machinery.

Planned components:

Basic distillation loss: KL divergence between teacher and student logits (temperature-scaled), as a training module alongside dpo.py and grpo.py.
Online vs offline distillation. Offline: pre-compute teacher logits and store them. Online: run teacher forward pass during student training. The trade-off is memory vs flexibility.
Layer-wise distillation: matching intermediate representations, not just final logits. This requires compatible architectures (same hidden dimension or a projection layer).
Self-distillation: a model distills into a smaller version of itself, with potential for iterative compression.

Experiment: train a LLaMA-small teacher on Shakespeare, distill into a GPT-tiny student. Compare student quality against training from scratch at the same compute budget, and test whether the gap widens as the student-teacher size ratio increases.

RL with Verifiable Rewards (Code Generation)

Reinforcement learning works best when rewards are unambiguous. Code is a natural domain because correctness is verifiable: run the tests, check whether they pass. This eliminates the reward model noise that affects RLHF on open-ended text.

Existing primitives:

grpo_loss(): Group Relative Policy Optimization, taking scalar rewards per completion
pass_at_k() / evaluate_pass_at_k(): code generation metrics scoring completions by test passage
best_of_n(): generate multiple completions, select the best

Planned pipeline:

Problem format: function-completion tasks (given a docstring with examples, generate the function body), starting with arithmetic and string problems where test cases are easy to generate.
Reward function: execute generated code in a sandbox, run test cases, return pass rate as the reward signal (binary or fractional).
Training loop: for each problem, generate K completions, score each with the reward function, compute GRPO loss. This connects grpo_loss directly to pass_at_k as the reward signal.
Difficulty curriculum: start with single-function simple logic, progress to harder problems.

Research questions:

How many completions per problem (K) are needed for stable GRPO training? Theory favors more for variance reduction, but compute scales linearly.
Does the model generalize from easy problems to hard ones, or overfit to the reward signal on seen problem types?
How does RL fine-tuning compare to SFT on verified solutions? SFT is simpler but requires curated correct solutions; RL can learn from the model's own attempts.
On unified memory, can code execution and model inference run simultaneously without contention?

This extends GRPO from the optimizer comparison in Experiment 3 to code tasks, and extends the pass_at_k metric from passive measurement to active training signal.

Open Questions

How does training dynamics change between Apple Silicon and CUDA for the same architecture? Are learning rate sensitivities the same?
What's the optimal mx.eval() frequency? Too many evals waste time on synchronization; too few let the computation graph grow unboundedly.
Does MLA's KV compression provide real memory benefits on unified memory, or is the bottleneck elsewhere? (See Experiment 4.)
Can we use MLX's compilation to automatically fuse attention + FFN in a single block (like FlashAttention does for attention alone)?
How does the 3:1 DeltaNet:GQA ratio in Qwen 3.5 compare to other ratios (2:1, 4:1) on our educational-scale models?
At what model size does compilation speedup become worth the first-step overhead? (See Experiment 2.)
What are the scaling laws for inference compute on Apple Silicon? (See Test-Time Compute Scaling above.)
Can distillation produce models that are better than training from scratch at the same size? (See Knowledge Distillation above.)
How effective is RL with verifiable rewards compared to SFT on curated solutions? (See RL with Verifiable Rewards above.)

Developer Log

Design Decisions

Config Factories Over Class Hierarchies

Registry Pattern for Components

Explicit mx.eval() Boundaries

Lessons Learned

Unified Memory Changes the Trade-offs

mx.compile Is Not Free

LoRA and QLoRA on Unified Memory

Testing ML Code

Architecture Notes

Attention Variants Are Configurations, Not Architectures

MoE Is Just a Different FFN

HuggingFace Integration

Components

Streaming for Large Datasets

Advanced Training Features

DPO, GRPO, and MTP

Curriculum Learning: Start Easy

Pre-Registered Experiment Plans

Experiment 1: GPT-to-LLaMA Feature Ablation

Experiment 2: mx.compile Coverage Analysis

Experiment 3: Optimizer Comparison on Unified Memory

Experiment 4: KV Cache Reduction with MLA

Experiment 5: What Can You Train in 5 Minutes?

Research Directions

Test-Time Compute Scaling

Knowledge Distillation

RL with Verifiable Rewards (Code Generation)

Open Questions

Explicit `mx.eval()` Boundaries

`mx.compile` Is Not Free