Experiment Methodology
lmxlab includes an experiment framework inspired by Karpathy's autoresearch patterns. This page explains how experiments are designed, run, and tracked.
Core Principles
FLOP-Matched Comparisons
Architecture comparisons use FLOP-matched compute budgets (DEC-004). Each architecture trains until it consumes the same total floating-point operations, isolating architectural quality from implementation speed.
from lmxlab.training.callbacks import FLOPCounter
# Each architecture gets the same compute budget
flop_counter = FLOPCounter(
flops_per_step=estimate_flops_per_step(model, seq_len, batch_size),
flop_budget=1e15, # 1 PFLOPs
)
FLOPs are estimated analytically using Megatron-LM-style formulas
(6 * N * D for dense transformers, with corrections for SwiGLU
gated FFNs). See experiments/flops.py for details.
Time budgets as secondary metric
Wall-clock time budgets (DEC-001) are still available for efficiency benchmarks where speed is the metric of interest, but FLOP-matched is the primary method for architecture comparisons.
Validation Split
Every experiment uses a train/val split (DEC-008). Validation loss is the primary metric; training loss is a secondary diagnostic only.
- Shakespeare char-level: 90/10 sequential split (~1.0M train tokens, ~111K val tokens), matching nanoGPT convention
- TinyStories BPE: Uses the dataset's built-in train/val splits from HuggingFace
Evaluation uses shuffle=False for deterministic results.
Periodic eval runs every 500 steps plus a final eval at the end
of training.
Superseded experiments
HYP-001 and HYP-001b had no validation split and reported training loss as the primary metric. This masked severe overfitting. Results from these runs are superseded; see Results for trusted findings.
Git-as-Experiment-Infra
Each experiment records the git commit hash, so results are tied to a specific code version. This provides natural reproducibility without external tooling.
Simplicity Bias
When two approaches achieve similar metrics, prefer the simpler one
(fewer parameters, less code, fewer hyperparameters). The simplicity_score
function quantifies this:
from lmxlab.experiments.analysis import simplicity_score
# Rewards improvements that use fewer parameters
score = simplicity_score(
entry,
baseline_params=1_000_000,
baseline_metric=3.5,
)
Multi-Seed Runs
Single-seed results are unreliable. Run experiments with multiple seeds and report mean +/- std:
Tracking
results.jsonl
All experiments log to experiments/results.jsonl, a line-delimited
JSON file suitable for parsing, version control, and analysis. Each
entry records:
| Field | Description |
|---|---|
experiment |
Name/tag |
commit |
Git commit hash |
status |
keep, discard, or crash |
val_loss |
Validation loss |
val_bpb |
Bits per byte |
train_loss |
Final training loss |
param_count |
Model parameters |
wall_time_s |
Total wall-clock time |
seed |
Random seed |
config |
Full experiment config dict |
metrics |
All collected metrics |
MLflow Integration
Experiments can optionally log to MLflow for interactive visualization. MLflow uses a local SQLite backend by default:
from lmxlab.experiments.mlflow import MLflowExperimentRunner
runner = MLflowExperimentRunner(config)
runner.start() # logs to sqlite:///mlflow.db
Status: Keep vs Discard
After each experiment, compare against previous best. If the new
result improves the target metric, mark it keep; otherwise discard.
Crashed experiments are marked crash. This provides a quick way
to filter results.
Sweep Utilities
Grid Sweep
Exhaustive search over discrete parameter values:
from lmxlab.experiments.sweep import grid_sweep
for params in grid_sweep({
"lr": [1e-4, 3e-4, 1e-3],
"d_model": [64, 128, 256],
}):
# params = {"lr": 1e-4, "d_model": 64}, etc.
train_model(**params)
Random Sweep
Sample from continuous ranges (often more efficient than grid search for high-dimensional spaces):
from lmxlab.experiments.sweep import random_sweep
for params in random_sweep(
param_ranges={"lr": (1e-4, 5e-3), "d_model": (32, 256)},
n_trials=20,
):
train_model(**params)
Profiling
MLX-specific profiling tools for understanding performance on Apple Silicon:
from lmxlab.experiments.profiling import (
benchmark_fn,
memory_estimate,
profile_forward,
profile_generation,
)
# Time any function
timing = benchmark_fn(lambda: model(tokens), n_iter=10)
# Model memory footprint
mem = memory_estimate(model)
# Forward pass throughput
fwd = profile_forward(model, tokens)
print(f"{fwd['tokens_per_sec']:.0f} tokens/sec")
# Generation speed (prefill + decode)
gen = profile_generation(model, prompt, max_tokens=50)
print(f"Prefill: {gen['prefill_ms']:.1f}ms")
print(f"Decode: {gen['decode_ms_per_token']:.1f}ms/token")
Recipe Scripts
Experiment scripts:
| Recipe | Experiment | Description |
|---|---|---|
run_experiment.py |
General | Structured experiment with tracking |
hyp006_dropout_norm.py |
HYP-006 | Dropout x normalization at 30M params |
hybrid_baselines.py |
Hybrid | 5-architecture comparison at 10M params |
sweep_learning_rate.py |
General | Grid/random learning rate sweep |
benchmark_compile.py |
General | mx.compile speedup measurement |
profile_models.py |
General | Architecture profiling comparison |
compare_training.py |
General | Architecture training dynamics |
compare_architectures.py |
General | Side-by-side architecture comparison |
ablation_gpt_to_llama.py |
HYP-001 | Feature ablation study |
compare_schedules.py |
General | LR schedules and optimizer comparison |
analyze_experiments.py |
General | Statistical analysis tools |
Pre-Registered Experiments
Following Platt's strong inference and Chamberlin's multiple working hypotheses, lmxlab pre-registers experiments with competing hypotheses and falsification criteria before running them. This guards against confirmation bias and the garden of forking paths (Gelman & Loken, 2013).
Each pre-registered experiment specifies:
- A question (what is to be learned)
- Competing hypotheses (at least 2-4 plausible explanations)
- Design (controlled experimental conditions)
- Analysis plan (how results will be interpreted)
- Falsification criteria (what would disprove each hypothesis)
Completed experiments with trusted results:
- HYP-001c/d: GPT-to-LLaMA feature ablation at 3M params
- HYP-006: Dropout x normalization interaction at 30M params
- Hybrid baselines: 5-architecture comparison at 10M params
See Results for findings from these experiments.
Why Pre-Registration Matters
Without pre-registration, hypotheses may be unconsciously adjusted after seeing results, fitting a narrative to the data rather than testing a prediction against it. Pre-registration commits to the analysis before results are known, which:
- Makes positive results more credible (they were predicted, not retrofitted)
- Makes negative results informative (they falsify a stated hypothesis)
- Forces clearer reasoning about expectations and their justifications
Running Experiments
from lmxlab.experiments.runner import ExperimentConfig, ExperimentRunner
# 1. Configure
config = ExperimentConfig(
name="my-experiment",
description="Testing new learning rate",
time_budget_s=300.0,
seed=42,
)
# 2. Run
runner = ExperimentRunner(config)
runner.start()
# ... train your model ...
# 3. Check time budget
if runner.is_time_up():
print("Time's up!")
# 4. Log results
entry = runner.finish(
metrics={"val_loss": 2.5, "val_bpb": 1.8},
param_count=model.count_parameters(),
config_dict={"lr": 1e-3, "arch": "llama"},
status="keep",
)