Recipes
Scripts in the recipes/ directory. Each script creates its own data,
builds models, and prints results. External downloads are noted where
applicable.
# Run any recipe
uv run python recipes/<recipe_name>.py
# Most recipes accept --help for CLI options
uv run python recipes/train_tiny_gpt.py --help
Training Basics
These scripts train small models on Shakespeare text and demonstrate the core training loop.
| Recipe | Description |
|---|---|
train_tiny_gpt.py |
End-to-end workflow: build, train, generate |
train_llama_shakespeare.py |
BPE tokenization, compiled training |
train_curriculum.py |
Length curriculum: short sequences before long |
checkpoint_resume.py |
Save/load checkpoints, resume training |
train_with_callbacks.py |
Logging, throughput monitoring, early stopping |
train_with_datasets.py |
TextDataset vs TokenDataset classes |
compare_schedules.py |
LR schedules and optimizer comparison |
train_tiny_gpt
Trains a tiny GPT on Shakespeare with character-level tokenization. Serves as an installation verification test.
train_llama_shakespeare
Same task, better architecture. Uses LLaMA config (GQA, SwiGLU, RoPE)
with tiktoken BPE tokenization. Demonstrates compiled training with
mx.compile.
train_curriculum
Curriculum learning: start with short sequences (easy), gradually increase to full length. Often converges faster than training on full length from the start.
checkpoint_resume
Save training state (weights, optimizer, step count) to disk, then resume from the checkpoint. Demonstrates lmxlab's safetensors-based checkpointing.
train_with_callbacks
Runs all three callback types simultaneously: MetricsLogger prints
loss at intervals, ThroughputMonitor reports tokens/sec, and
EarlyStopping halts training when eval loss plateaus.
train_with_datasets
Compares TextDataset (raw text in, handles tokenization) with
TokenDataset (pre-tokenized arrays). Shows that both produce
identical training windows and when to use each.
compare_schedules
Trains the same model with different LR schedule and optimizer combinations. Compares cosine vs linear vs constant decay, and AdamW vs Lion vs Adafactor. Shows loss curves at checkpoints.
Fine-Tuning
Parameter-efficient methods for adapting pretrained models.
| Recipe | Description |
|---|---|
finetune_lora.py |
LoRA: train ~0.1% of parameters |
finetune_qlora.py |
QLoRA: 4-bit base + LoRA adapters |
load_pretrained.py |
Load HuggingFace models into lmxlab |
finetune_lora
Apply low-rank adapters to attention layers. Freezes all base weights and trains only small A and B matrices. After training, merge adapters back into the base model for inference.
finetune_qlora
Combine quantization and LoRA. Base weights are stored in 4-bit (or 8-bit), while LoRA adapters train in full precision. Maximum memory efficiency for fine-tuning large models.
load_pretrained
Download a pretrained model from HuggingFace Hub, convert weights to
lmxlab format, and generate text. Requires huggingface_hub and
transformers packages.
Advanced Training
Specialized training objectives beyond standard next-token prediction.
| Recipe | Description |
|---|---|
train_dpo.py |
Direct Preference Optimization |
train_grpo.py |
Group Relative Policy Optimization |
train_mtp.py |
Multi-Token Prediction (auxiliary heads) |
train_moe.py |
Mixture of Experts routing |
train_deltanet.py |
Hybrid linear + softmax attention |
distill_model.py |
Knowledge distillation (teacher → student) |
train_dpo
Two-phase training: supervised fine-tuning (SFT) first, then DPO with preference pairs. Learns "A is better than B" without a reward model.
train_grpo
SFT followed by GRPO. Generates multiple completions per prompt, scores them, and normalizes rewards within each group. Closer to classic policy gradient than DPO.
train_mtp
Multi-Token Prediction: each position predicts the next 2-4 tokens via lightweight auxiliary heads. Provides richer gradients and enables speculative decoding at inference time.
train_moe
Mixture of Experts: sparse routing where each token uses only top-k of N expert FFNs. Compares dense vs MoE at matched compute budgets.
train_deltanet
Hybrid attention from Qwen 3.5: interleaves Gated DeltaNet (linear attention with delta rule) and standard GQA layers. DeltaNet uses fixed-size state, O(d^2) per token regardless of sequence length.
distill_model
Knowledge distillation (Hinton et al., 2015): train a larger teacher (LLaMA-tiny), then transfer its knowledge to a smaller student (GPT-tiny) via temperature-scaled soft targets. Compares distilled student against a baseline student trained without distillation.
Inference & Sampling
Different strategies for generating text from trained models.
| Recipe | Description |
|---|---|
interactive_generate.py |
Streaming token-by-token generation |
advanced_sampling.py |
Best-of-N and majority vote |
speculative_decoding.py |
Draft-then-verify for faster generation |
interactive_generate
Streaming generation with stream_generate. Tokens appear one at a time
as they are produced. Demonstrates repetition penalty and temperature
control.
advanced_sampling
Inference-time compute scaling: generate N completions and pick the best (by log-probability), or generate N answers and take the majority vote.
speculative_decoding
A small draft model proposes K tokens, then a larger target model verifies them in a single forward pass. The procedure is lossless: the output distribution is identical to the target model's. On unified memory, both models share the same memory pool, avoiding transfer overhead.
Architecture Comparison
Scripts for understanding the differences between transformer architectures.
| Recipe | Description |
|---|---|
compare_architectures.py |
Parameter counts, KV cache sizes for all 8 architectures |
compare_training.py |
Training dynamics: loss curves across architectures |
ablation_gpt_to_llama.py |
Feature ablation: which LLaMA innovation matters most? |
compare_kv_cache.py |
MLA vs GQA KV cache and generation speed (Exp 4) |
compare_architectures
Instantiates all 8 architectures at matched dimensions and compares parameter counts, KV cache sizes, and structural differences. No training is performed; only model construction and analysis.
compare_training
Trains GPT, LLaMA, and DeepSeek on the same data with the same seed. Compares loss curves and convergence behavior.
ablation_gpt_to_llama
The pre-registered Experiment 1 from the devlog. Starts from GPT baseline and adds LLaMA features one at a time (RMSNorm, RoPE, SwiGLU, GQA, no-bias) to measure individual contributions.
compare_kv_cache
Pre-registered Experiment 4: compares DeepSeek-style MLA against standard GQA at matched dimensions. Profiles forward pass throughput and autoregressive generation at increasing sequence lengths. Tests whether MLA's compressed KV cache provides speed or memory benefits on unified memory.
Benchmarks & Profiling
Performance measurement on local hardware.
| Recipe | Description |
|---|---|
benchmark_compile.py |
mx.compile speedup measurement |
profile_models.py |
Memory, throughput, and generation speed |
evaluate_model.py |
Perplexity and bits-per-byte metrics |
quantize_and_generate.py |
4-bit/8-bit quantization and memory comparison |
benchmark_compile
Measures training step time with and without mx.compile at multiple
model sizes. Quantifies the speedup from kernel fusion on a given Apple Silicon
chip.
profile_models
Benchmarks forward pass, backward pass, and generation across five architectures. Reports memory usage estimates and tokens/second.
evaluate_model
Trains on a train/val split and reports perplexity and bits-per-byte (BPB). Compares GPT and LLaMA on the same evaluation set.
quantize_and_generate
Train a model, then quantize to 4-bit and 8-bit. Compares memory
usage and generation quality across precisions. Also demonstrates
dequantize_model for converting back to float (useful before
fine-tuning).
Experiments
Structured experiment infrastructure for reproducible research.
| Recipe | Description |
|---|---|
run_experiment.py |
Time-budgeted experiments with logging |
sweep_learning_rate.py |
Grid and random hyperparameter sweeps |
analyze_experiments.py |
Statistical analysis: CI, Cohen's d, simplicity score |
compare_optimizers.py |
Optimizer comparison on unified memory (Exp 3) |
run_experiment
Uses ExperimentRunner for time-budgeted experiments following the
autoresearch methodology. Logs results to
results.jsonl, supports multi-seed runs, and tracks keep/discard status.
sweep_learning_rate
Grid sweep and random sweep over learning rate ranges. Reports best configuration and trial-by-trial results table.
analyze_experiments
Demonstrates the experiment analysis toolkit on synthetic data. Covers
compare_experiments, compute_statistics, confidence_interval,
cohens_d, and simplicity_score. No training required; runs on
synthetic data.
compare_optimizers
Pre-registered Experiment 3: compares AdamW, SGD+momentum, Adafactor, and Lion across a learning rate sweep. Tests whether unified memory changes which optimizer wins. Reports best-of-sweep per optimizer, throughput (steps/s), and evaluates three competing hypotheses.