Experiment Methodology

lmxlab includes an experiment framework inspired by Karpathy's autoresearch patterns. This page explains how experiments are designed, run, and tracked.

Core Principles

FLOP-Matched Comparisons

Architecture comparisons use FLOP-matched compute budgets (DEC-004). Each architecture trains until it consumes the same total floating-point operations, isolating architectural quality from implementation speed.

from lmxlab.training.callbacks import FLOPCounter

# Each architecture gets the same compute budget
flop_counter = FLOPCounter(
    flops_per_step=estimate_flops_per_step(model, seq_len, batch_size),
    flop_budget=1e15,  # 1 PFLOPs
)

FLOPs are estimated analytically using Megatron-LM-style formulas (6 * N * D for dense transformers, with corrections for SwiGLU gated FFNs). See experiments/flops.py for details.

Time budgets as secondary metric

Wall-clock time budgets (DEC-001) are still available for efficiency benchmarks where speed is the metric of interest, but FLOP-matched is the primary method for architecture comparisons.

Validation Split

Every experiment uses a train/val split (DEC-008). Validation loss is the primary metric; training loss is a secondary diagnostic only.

Shakespeare char-level: 90/10 sequential split (~1.0M train tokens, ~111K val tokens), matching nanoGPT convention
TinyStories BPE: Uses the dataset's built-in train/val splits from HuggingFace

Evaluation uses shuffle=False for deterministic results. Periodic eval runs every 500 steps plus a final eval at the end of training.

Superseded experiments

HYP-001 and HYP-001b had no validation split and reported training loss as the primary metric. This masked severe overfitting. Results from these runs are superseded; see Results for trusted findings.

Git-as-Experiment-Infra

Each experiment records the git commit hash, so results are tied to a specific code version. This provides natural reproducibility without external tooling.

Simplicity Bias

When two approaches achieve similar metrics, prefer the simpler one (fewer parameters, less code, fewer hyperparameters). The simplicity_score function quantifies this:

from lmxlab.experiments.analysis import simplicity_score

# Rewards improvements that use fewer parameters
score = simplicity_score(
    entry,
    baseline_params=1_000_000,
    baseline_metric=3.5,
)

Multi-Seed Runs

Single-seed results are unreliable. Run experiments with multiple seeds and report mean +/- std:

uv run python recipes/hyp006_dropout_norm.py  # runs 3 seeds per config

Tracking

results.jsonl

All experiments log to experiments/results.jsonl, a line-delimited JSON file suitable for parsing, version control, and analysis. Each entry records:

Field	Description
`experiment`	Name/tag
`commit`	Git commit hash
`status`	`keep`, `discard`, or `crash`
`val_loss`	Validation loss
`val_bpb`	Bits per byte
`train_loss`	Final training loss
`param_count`	Model parameters
`wall_time_s`	Total wall-clock time
`seed`	Random seed
`config`	Full experiment config dict
`metrics`	All collected metrics

MLflow Integration

Experiments can optionally log to MLflow for interactive visualization. MLflow uses a local SQLite backend by default:

from lmxlab.experiments.mlflow import MLflowExperimentRunner

runner = MLflowExperimentRunner(config)
runner.start()  # logs to sqlite:///mlflow.db

Status: Keep vs Discard

After each experiment, compare against previous best. If the new result improves the target metric, mark it keep; otherwise discard. Crashed experiments are marked crash. This provides a quick way to filter results.

Sweep Utilities

Grid Sweep

Exhaustive search over discrete parameter values:

from lmxlab.experiments.sweep import grid_sweep

for params in grid_sweep({
    "lr": [1e-4, 3e-4, 1e-3],
    "d_model": [64, 128, 256],
}):
    # params = {"lr": 1e-4, "d_model": 64}, etc.
    train_model(**params)

Random Sweep

Sample from continuous ranges (often more efficient than grid search for high-dimensional spaces):

from lmxlab.experiments.sweep import random_sweep

for params in random_sweep(
    param_ranges={"lr": (1e-4, 5e-3), "d_model": (32, 256)},
    n_trials=20,
):
    train_model(**params)

Profiling

MLX-specific profiling tools for understanding performance on Apple Silicon:

from lmxlab.experiments.profiling import (
    benchmark_fn,
    memory_estimate,
    profile_forward,
    profile_generation,
)

# Time any function
timing = benchmark_fn(lambda: model(tokens), n_iter=10)

# Model memory footprint
mem = memory_estimate(model)

# Forward pass throughput
fwd = profile_forward(model, tokens)
print(f"{fwd['tokens_per_sec']:.0f} tokens/sec")

# Generation speed (prefill + decode)
gen = profile_generation(model, prompt, max_tokens=50)
print(f"Prefill: {gen['prefill_ms']:.1f}ms")
print(f"Decode: {gen['decode_ms_per_token']:.1f}ms/token")

Recipe Scripts

Experiment scripts:

Recipe	Experiment	Description
`run_experiment.py`	General	Structured experiment with tracking
`hyp006_dropout_norm.py`	HYP-006	Dropout x normalization at 30M params
`hybrid_baselines.py`	Hybrid	5-architecture comparison at 10M params
`sweep_learning_rate.py`	General	Grid/random learning rate sweep
`benchmark_compile.py`	General	`mx.compile` speedup measurement
`profile_models.py`	General	Architecture profiling comparison
`compare_training.py`	General	Architecture training dynamics
`compare_architectures.py`	General	Side-by-side architecture comparison
`ablation_gpt_to_llama.py`	HYP-001	Feature ablation study
`compare_schedules.py`	General	LR schedules and optimizer comparison
`analyze_experiments.py`	General	Statistical analysis tools

Pre-Registered Experiments

Following Platt's strong inference and Chamberlin's multiple working hypotheses, lmxlab pre-registers experiments with competing hypotheses and falsification criteria before running them. This guards against confirmation bias and the garden of forking paths (Gelman & Loken, 2013).

Each pre-registered experiment specifies:

A question (what is to be learned)
Competing hypotheses (at least 2-4 plausible explanations)
Design (controlled experimental conditions)
Analysis plan (how results will be interpreted)
Falsification criteria (what would disprove each hypothesis)

Completed experiments with trusted results:

HYP-001c/d: GPT-to-LLaMA feature ablation at 3M params
HYP-006: Dropout x normalization interaction at 30M params
Hybrid baselines: 5-architecture comparison at 10M params

See Results for findings from these experiments.

Why Pre-Registration Matters

Without pre-registration, hypotheses may be unconsciously adjusted after seeing results, fitting a narrative to the data rather than testing a prediction against it. Pre-registration commits to the analysis before results are known, which:

Makes positive results more credible (they were predicted, not retrofitted)
Makes negative results informative (they falsify a stated hypothesis)
Forces clearer reasoning about expectations and their justifications

Running Experiments

from lmxlab.experiments.runner import ExperimentConfig, ExperimentRunner

# 1. Configure
config = ExperimentConfig(
    name="my-experiment",
    description="Testing new learning rate",
    time_budget_s=300.0,
    seed=42,
)

# 2. Run
runner = ExperimentRunner(config)
runner.start()

# ... train your model ...

# 3. Check time budget
if runner.is_time_up():
    print("Time's up!")

# 4. Log results
entry = runner.finish(
    metrics={"val_loss": 2.5, "val_bpb": 1.8},
    param_count=model.count_parameters(),
    config_dict={"lr": 1e-3, "arch": "llama"},
    status="keep",
)