Recipes

Scripts in the recipes/ directory. Each script creates its own data, builds models, and prints results. External downloads are noted where applicable.

# Run any recipe
uv run python recipes/<recipe_name>.py

# Most recipes accept --help for CLI options
uv run python recipes/train_tiny_gpt.py --help

Training Basics

These scripts train small models on Shakespeare text and demonstrate the core training loop.

Recipe	Description
`train_tiny_gpt.py`	End-to-end workflow: build, train, generate
`train_llama_shakespeare.py`	BPE tokenization, compiled training
`train_curriculum.py`	Length curriculum: short sequences before long
`checkpoint_resume.py`	Save/load checkpoints, resume training
`train_with_callbacks.py`	Logging, throughput monitoring, early stopping
`train_with_datasets.py`	TextDataset vs TokenDataset classes
`compare_schedules.py`	LR schedules and optimizer comparison

train_tiny_gpt

Trains a tiny GPT on Shakespeare with character-level tokenization. Serves as an installation verification test.

uv run python recipes/train_tiny_gpt.py

train_llama_shakespeare

Same task, better architecture. Uses LLaMA config (GQA, SwiGLU, RoPE) with tiktoken BPE tokenization. Demonstrates compiled training with mx.compile.

uv run python recipes/train_llama_shakespeare.py --steps 500

train_curriculum

Curriculum learning: start with short sequences (easy), gradually increase to full length. Often converges faster than training on full length from the start.

uv run python recipes/train_curriculum.py --stages 4 --steps 500

checkpoint_resume

Save training state (weights, optimizer, step count) to disk, then resume from the checkpoint. Demonstrates lmxlab's safetensors-based checkpointing.

uv run python recipes/checkpoint_resume.py --steps 200

train_with_callbacks

Runs all three callback types simultaneously: MetricsLogger prints loss at intervals, ThroughputMonitor reports tokens/sec, and EarlyStopping halts training when eval loss plateaus.

uv run python recipes/train_with_callbacks.py --patience 5 --max-steps 300

train_with_datasets

Compares TextDataset (raw text in, handles tokenization) with TokenDataset (pre-tokenized arrays). Shows that both produce identical training windows and when to use each.

uv run python recipes/train_with_datasets.py --seq-len 64

compare_schedules

Trains the same model with different LR schedule and optimizer combinations. Compares cosine vs linear vs constant decay, and AdamW vs Lion vs Adafactor. Shows loss curves at checkpoints.

uv run python recipes/compare_schedules.py --optimizers adamw lion

Fine-Tuning

Parameter-efficient methods for adapting pretrained models.

Recipe	Description
`finetune_lora.py`	LoRA: train ~0.1% of parameters
`finetune_qlora.py`	QLoRA: 4-bit base + LoRA adapters
`load_pretrained.py`	Load HuggingFace models into lmxlab

finetune_lora

Apply low-rank adapters to attention layers. Freezes all base weights and trains only small A and B matrices. After training, merge adapters back into the base model for inference.

uv run python recipes/finetune_lora.py --rank 8 --steps 200

finetune_qlora

Combine quantization and LoRA. Base weights are stored in 4-bit (or 8-bit), while LoRA adapters train in full precision. Maximum memory efficiency for fine-tuning large models.

uv run python recipes/finetune_qlora.py --rank 8 --bits 4

load_pretrained

Download a pretrained model from HuggingFace Hub, convert weights to lmxlab format, and generate text. Requires huggingface_hub and transformers packages.

uv run python recipes/load_pretrained.py --repo meta-llama/Llama-3.2-1B

Advanced Training

Specialized training objectives beyond standard next-token prediction.

Recipe	Description
`train_dpo.py`	Direct Preference Optimization
`train_grpo.py`	Group Relative Policy Optimization
`train_mtp.py`	Multi-Token Prediction (auxiliary heads)
`train_moe.py`	Mixture of Experts routing
`train_deltanet.py`	Hybrid linear + softmax attention
`distill_model.py`	Knowledge distillation (teacher → student)

train_dpo

Two-phase training: supervised fine-tuning (SFT) first, then DPO with preference pairs. Learns "A is better than B" without a reward model.

uv run python recipes/train_dpo.py --dpo-steps 50

train_grpo

SFT followed by GRPO. Generates multiple completions per prompt, scores them, and normalizes rewards within each group. Closer to classic policy gradient than DPO.

uv run python recipes/train_grpo.py --grpo-steps 50

train_mtp

Multi-Token Prediction: each position predicts the next 2-4 tokens via lightweight auxiliary heads. Provides richer gradients and enables speculative decoding at inference time.

uv run python recipes/train_mtp.py --n-predict 2

train_moe

Mixture of Experts: sparse routing where each token uses only top-k of N expert FFNs. Compares dense vs MoE at matched compute budgets.

uv run python recipes/train_moe.py --experts 4 --top-k 2

train_deltanet

Hybrid attention from Qwen 3.5: interleaves Gated DeltaNet (linear attention with delta rule) and standard GQA layers. DeltaNet uses fixed-size state, O(d^2) per token regardless of sequence length.

uv run python recipes/train_deltanet.py --steps 300

distill_model

Knowledge distillation (Hinton et al., 2015): train a larger teacher (LLaMA-tiny), then transfer its knowledge to a smaller student (GPT-tiny) via temperature-scaled soft targets. Compares distilled student against a baseline student trained without distillation.

uv run python recipes/distill_model.py --temperature 4 --alpha 0.7

Inference & Sampling

Different strategies for generating text from trained models.

Recipe	Description
`interactive_generate.py`	Streaming token-by-token generation
`advanced_sampling.py`	Best-of-N and majority vote
`speculative_decoding.py`	Draft-then-verify for faster generation

interactive_generate

Streaming generation with stream_generate. Tokens appear one at a time as they are produced. Demonstrates repetition penalty and temperature control.

uv run python recipes/interactive_generate.py --temperature 0.8

advanced_sampling

Inference-time compute scaling: generate N completions and pick the best (by log-probability), or generate N answers and take the majority vote.

uv run python recipes/advanced_sampling.py --n 8

speculative_decoding

A small draft model proposes K tokens, then a larger target model verifies them in a single forward pass. The procedure is lossless: the output distribution is identical to the target model's. On unified memory, both models share the same memory pool, avoiding transfer overhead.

uv run python recipes/speculative_decoding.py --draft-tokens 4

Architecture Comparison

Scripts for understanding the differences between transformer architectures.

Recipe	Description
`compare_architectures.py`	Parameter counts, KV cache sizes for all 8 architectures
`compare_training.py`	Training dynamics: loss curves across architectures
`ablation_gpt_to_llama.py`	Feature ablation: which LLaMA innovation matters most?
`compare_kv_cache.py`	MLA vs GQA KV cache and generation speed (Exp 4)

compare_architectures

Instantiates all 8 architectures at matched dimensions and compares parameter counts, KV cache sizes, and structural differences. No training is performed; only model construction and analysis.

uv run python recipes/compare_architectures.py

compare_training

Trains GPT, LLaMA, and DeepSeek on the same data with the same seed. Compares loss curves and convergence behavior.

uv run python recipes/compare_training.py --steps 300

ablation_gpt_to_llama

The pre-registered Experiment 1 from the devlog. Starts from GPT baseline and adds LLaMA features one at a time (RMSNorm, RoPE, SwiGLU, GQA, no-bias) to measure individual contributions.

uv run python recipes/ablation_gpt_to_llama.py --steps 200

compare_kv_cache

Pre-registered Experiment 4: compares DeepSeek-style MLA against standard GQA at matched dimensions. Profiles forward pass throughput and autoregressive generation at increasing sequence lengths. Tests whether MLA's compressed KV cache provides speed or memory benefits on unified memory.

uv run python recipes/compare_kv_cache.py --d-model 128 --max-gen 512

Benchmarks & Profiling

Performance measurement on local hardware.

Recipe	Description
`benchmark_compile.py`	`mx.compile` speedup measurement
`profile_models.py`	Memory, throughput, and generation speed
`evaluate_model.py`	Perplexity and bits-per-byte metrics
`quantize_and_generate.py`	4-bit/8-bit quantization and memory comparison

benchmark_compile

Measures training step time with and without mx.compile at multiple model sizes. Quantifies the speedup from kernel fusion on a given Apple Silicon chip.

uv run python recipes/benchmark_compile.py

profile_models

Benchmarks forward pass, backward pass, and generation across five architectures. Reports memory usage estimates and tokens/second.

uv run python recipes/profile_models.py

evaluate_model

Trains on a train/val split and reports perplexity and bits-per-byte (BPB). Compares GPT and LLaMA on the same evaluation set.

uv run python recipes/evaluate_model.py

quantize_and_generate

Train a model, then quantize to 4-bit and 8-bit. Compares memory usage and generation quality across precisions. Also demonstrates dequantize_model for converting back to float (useful before fine-tuning).

uv run python recipes/quantize_and_generate.py --bits 4 8

Experiments

Structured experiment infrastructure for reproducible research.

Recipe	Description
`run_experiment.py`	Time-budgeted experiments with logging
`sweep_learning_rate.py`	Grid and random hyperparameter sweeps
`analyze_experiments.py`	Statistical analysis: CI, Cohen's d, simplicity score
`compare_optimizers.py`	Optimizer comparison on unified memory (Exp 3)

run_experiment

Uses ExperimentRunner for time-budgeted experiments following the autoresearch methodology. Logs results to results.jsonl, supports multi-seed runs, and tracks keep/discard status.

uv run python recipes/run_experiment.py --arch llama --seeds 3

sweep_learning_rate

Grid sweep and random sweep over learning rate ranges. Reports best configuration and trial-by-trial results table.

uv run python recipes/sweep_learning_rate.py --min-lr 1e-4 --max-lr 1e-2

analyze_experiments

Demonstrates the experiment analysis toolkit on synthetic data. Covers compare_experiments, compute_statistics, confidence_interval, cohens_d, and simplicity_score. No training required; runs on synthetic data.

uv run python recipes/analyze_experiments.py

compare_optimizers

Pre-registered Experiment 3: compares AdamW, SGD+momentum, Adafactor, and Lion across a learning rate sweep. Tests whether unified memory changes which optimizer wins. Reports best-of-sweep per optimizer, throughput (steps/s), and evaluates three competing hypotheses.

uv run python recipes/compare_optimizers.py --seeds 3 --steps 300