Experiments
Experiment runner, sweeps, tracking, and analysis.
Runner
lmxlab.experiments.runner.ExperimentConfig
dataclass
Configuration for an experiment run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Experiment name/tag. |
'experiment'
|
description
|
str
|
Human-readable description. |
''
|
time_budget_s
|
float
|
Maximum wall-clock time in seconds. |
300.0
|
flop_budget
|
float | None
|
Maximum FLOPs budget (e.g. 1e15 for 1 PFLOPs). |
None
|
seed
|
int
|
Random seed. |
42
|
output_dir
|
str
|
Directory for outputs. |
'experiments'
|
Source code in src/lmxlab/experiments/runner.py
description = ''
class-attribute
instance-attribute
flop_budget = None
class-attribute
instance-attribute
name = 'experiment'
class-attribute
instance-attribute
output_dir = 'experiments'
class-attribute
instance-attribute
seed = 42
class-attribute
instance-attribute
time_budget_s = 300.0
class-attribute
instance-attribute
__init__(name='experiment', description='', time_budget_s=300.0, flop_budget=None, seed=42, output_dir='experiments')
lmxlab.experiments.runner.ExperimentRunner
Run experiments with autoresearch patterns.
Enforces fixed time budgets, logs results to results.jsonl, and tracks git commits for reproducibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
ExperimentConfig
|
Experiment configuration. |
required |
log
|
ExperimentLog | None
|
Experiment log (defaults to results.jsonl in output_dir). |
None
|
Source code in src/lmxlab/experiments/runner.py
_start_time = 0.0
instance-attribute
config = config
instance-attribute
log = log or ExperimentLog(output / 'results.jsonl')
instance-attribute
__init__(config, log=None)
Source code in src/lmxlab/experiments/runner.py
finish(metrics, param_count=0, config_dict=None, status='keep')
Finish the experiment and log results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
dict[str, Any]
|
Dict of result metrics (must include 'val_loss' or 'val_bpb'). |
required |
param_count
|
int
|
Number of model parameters. |
0
|
config_dict
|
dict[str, Any] | None
|
Full experiment config for logging. |
None
|
status
|
str
|
'keep', 'discard', or 'crash'. |
'keep'
|
Returns:
| Type | Description |
|---|---|
LogEntry
|
The logged entry. |
Source code in src/lmxlab/experiments/runner.py
is_time_up()
start()
time_remaining()
Seconds remaining in the time budget.
Source code in src/lmxlab/experiments/runner.py
Sweep
lmxlab.experiments.sweep
Hyperparameter sweep utilities.
grid_sweep(param_grid)
Generate all combinations from a parameter grid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param_grid
|
dict[str, list[Any]]
|
Dict mapping parameter names to lists of values to try. |
required |
Yields:
| Type | Description |
|---|---|
dict[str, Any]
|
Dicts with one value per parameter. |
Example
list(grid_sweep({'lr': [1e-3, 1e-4], 'layers': [2, 4]})) [{'lr': 0.001, 'layers': 2}, {'lr': 0.001, 'layers': 4}, {'lr': 0.0001, 'layers': 2}, {'lr': 0.0001, 'layers': 4}]
Source code in src/lmxlab/experiments/sweep.py
random_sweep(param_ranges, n_trials=10, seed=42, log_scale=None)
Generate random parameter combinations.
Samples uniformly from continuous ranges by default.
Parameters listed in log_scale are sampled in
log-space, which is standard for learning rates and
other parameters spanning multiple orders of magnitude.
Uses Python's random module (no MLX dependency),
so sweep configuration can be computed without Apple
Silicon hardware.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param_ranges
|
dict[str, tuple[float, float]]
|
Dict mapping parameter names to (min, max) tuples. |
required |
n_trials
|
int
|
Number of random combinations. |
10
|
seed
|
int
|
Random seed for reproducibility. |
42
|
log_scale
|
set[str] | None
|
Set of parameter names to sample in log-space. For these, (min, max) must both be positive. |
None
|
Yields:
| Type | Description |
|---|---|
dict[str, float]
|
Dicts with one sampled value per parameter. |
Example
configs = list(random_sweep( ... param_ranges={"lr": (1e-5, 1e-1), "d_model": (64, 512)}, ... n_trials=5, ... log_scale={"lr"}, ... ))
Source code in src/lmxlab/experiments/sweep.py
Tracking
lmxlab.experiments.tracking.LogEntry
dataclass
A single experiment result entry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
experiment
|
str
|
Experiment name/tag. |
''
|
commit
|
str
|
Git commit hash. |
''
|
status
|
str
|
Outcome ('keep', 'discard', 'crash'). |
'keep'
|
val_bpb
|
float
|
Validation bits-per-byte. |
0.0
|
val_loss
|
float
|
Validation loss. |
0.0
|
train_loss
|
float
|
Final training loss. |
0.0
|
param_count
|
int
|
Number of model parameters. |
0
|
total_flops
|
float
|
Total FLOPs consumed during training. |
0.0
|
peak_memory_mb
|
float
|
Peak memory usage in MB. |
0.0
|
wall_time_s
|
float
|
Wall clock time in seconds. |
0.0
|
description
|
str
|
Human-readable description. |
''
|
config
|
dict[str, Any]
|
Full experiment config dict. |
dict()
|
metrics
|
dict[str, Any]
|
Additional metrics dict. |
dict()
|
timestamp
|
float
|
Unix timestamp (auto-filled). |
time()
|
seed
|
int
|
Random seed used. |
42
|
Source code in src/lmxlab/experiments/tracking.py
commit = ''
class-attribute
instance-attribute
config = field(default_factory=dict)
class-attribute
instance-attribute
description = ''
class-attribute
instance-attribute
experiment = ''
class-attribute
instance-attribute
metrics = field(default_factory=dict)
class-attribute
instance-attribute
param_count = 0
class-attribute
instance-attribute
peak_memory_mb = 0.0
class-attribute
instance-attribute
seed = 42
class-attribute
instance-attribute
status = 'keep'
class-attribute
instance-attribute
timestamp = field(default_factory=(time.time))
class-attribute
instance-attribute
total_flops = 0.0
class-attribute
instance-attribute
train_loss = 0.0
class-attribute
instance-attribute
val_bpb = 0.0
class-attribute
instance-attribute
val_loss = 0.0
class-attribute
instance-attribute
wall_time_s = 0.0
class-attribute
instance-attribute
__init__(experiment='', commit='', status='keep', val_bpb=0.0, val_loss=0.0, train_loss=0.0, param_count=0, total_flops=0.0, peak_memory_mb=0.0, wall_time_s=0.0, description='', config=dict(), metrics=dict(), timestamp=time.time(), seed=42)
lmxlab.experiments.tracking.ExperimentLog
Append-only experiment log backed by results.jsonl.
This is the ground truth for all experiments. Zero dependencies, git-trackable, easy for agents to parse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to results.jsonl file. |
'results.jsonl'
|
Source code in src/lmxlab/experiments/tracking.py
path = Path(path)
instance-attribute
__init__(path='results.jsonl')
best(metric='val_bpb', lower_is_better=True)
Find the best entry by a metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
str
|
Name of the metric field. |
'val_bpb'
|
lower_is_better
|
bool
|
If True, minimize; else maximize. |
True
|
Returns:
| Type | Description |
|---|---|
LogEntry | None
|
Best LogEntry, or None if log is empty. |
Source code in src/lmxlab/experiments/tracking.py
load()
Load all entries from the log.
Returns:
| Type | Description |
|---|---|
list[LogEntry]
|
List of LogEntry objects. |
Source code in src/lmxlab/experiments/tracking.py
log(entry)
Append an entry to the log.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entry
|
LogEntry
|
Experiment result to log. |
required |
Source code in src/lmxlab/experiments/tracking.py
summary()
Get summary statistics of all experiments.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with counts, best metrics, etc. |
Source code in src/lmxlab/experiments/tracking.py
Analysis
lmxlab.experiments.analysis
Analysis utilities for experiment results.
cohens_d(group_a, group_b)
Compute Cohen's d effect size between two groups.
Uses pooled standard deviation (equal-variance assumption). Useful for reporting effect sizes alongside p-values, as recommended in the pre-registered experiment plans.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group_a
|
list[float]
|
Values from the first group. |
required |
group_b
|
list[float]
|
Values from the second group. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float
|
Cohen's d. Positive means group_a > group_b. |
|
Conventions |
float
|
|d| < 0.2 small, 0.5 medium, 0.8 large. |
Source code in src/lmxlab/experiments/analysis.py
compare_experiments(log, metric='val_bpb')
Compare all kept experiments by a metric.
Returns experiments sorted by the metric (ascending).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
log
|
ExperimentLog
|
Experiment log to analyze. |
required |
metric
|
str
|
Metric name to compare. |
'val_bpb'
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of dicts with experiment name, metric value, |
list[dict[str, Any]]
|
param_count, and wall_time. |
Source code in src/lmxlab/experiments/analysis.py
compute_statistics(values)
Compute basic statistics for a list of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[float]
|
List of numeric values. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict with mean, std, min, max, n. |
Source code in src/lmxlab/experiments/analysis.py
confidence_interval(values, confidence=0.95)
Compute a confidence interval for the mean.
Uses the t-distribution for small samples. Falls back to z-approximation for n >= 30.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[float]
|
Sample values. |
required |
confidence
|
float
|
Confidence level (default 0.95). |
0.95
|
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
(lower, upper) bounds of the confidence interval. |
Source code in src/lmxlab/experiments/analysis.py
simplicity_score(entry, baseline_params, baseline_metric, metric='val_bpb')
Score an experiment by the simplicity bias principle.
Rewards improvements that use fewer parameters. Score = metric_improvement * (baseline_params / param_count)
Higher is better. Positive means improvement over baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entry
|
LogEntry
|
Experiment entry to score. |
required |
baseline_params
|
int
|
Baseline parameter count. |
required |
baseline_metric
|
float
|
Baseline metric value. |
required |
metric
|
str
|
Metric name (lower is better). |
'val_bpb'
|
Returns:
| Type | Description |
|---|---|
float
|
Simplicity-weighted improvement score. |
Source code in src/lmxlab/experiments/analysis.py
Profiling
lmxlab.experiments.profiling.benchmark_fn(fn, n_warmup=3, n_iter=10)
Time a function with warmup iterations.
Runs the function n_warmup times (discarded), then n_iter times (timed). Returns timing statistics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fn
|
Callable[[], Any]
|
Callable to benchmark (should include mx.eval). |
required |
n_warmup
|
int
|
Number of warmup iterations. |
3
|
n_iter
|
int
|
Number of timed iterations. |
10
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict with mean_ms, std_ms, min_ms, max_ms, n_iter. |
Source code in src/lmxlab/experiments/profiling.py
lmxlab.experiments.profiling.memory_estimate(model)
Estimate model memory usage from parameter shapes and dtypes.
This is a static estimate based on parameter tensors. Actual memory usage during inference includes activations, KV cache, and MLX graph overhead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Model to estimate. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with total_bytes, total_mb, param_count, |
dict[str, Any]
|
and per-dtype breakdown. |
Source code in src/lmxlab/experiments/profiling.py
lmxlab.experiments.profiling.count_parameters_by_module(model)
Count parameters per top-level submodule.
Returns a dict mapping module names to their parameter counts, useful for understanding where parameters are concentrated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Model to analyze. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
Dict mapping module name to parameter count. |
Source code in src/lmxlab/experiments/profiling.py
lmxlab.experiments.profiling.profile_forward(model, tokens, n_warmup=2, n_iter=5)
Profile forward pass throughput.
Times the model's forward pass and computes tokens/second.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Language model to profile. |
required |
tokens
|
array
|
Input token IDs (batch, seq_len). |
required |
n_warmup
|
int
|
Warmup iterations. |
2
|
n_iter
|
int
|
Timed iterations. |
5
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with timing stats, tokens_per_sec, batch_size, |
dict[str, Any]
|
seq_len. |
Source code in src/lmxlab/experiments/profiling.py
lmxlab.experiments.profiling.profile_generation(model, prompt, max_tokens=50)
Profile autoregressive generation throughput.
Measures time-to-first-token (prompt processing) and per-token generation speed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
Language model. |
required |
prompt
|
array
|
Prompt token IDs (1, prompt_len). |
required |
max_tokens
|
int
|
Number of tokens to generate. |
50
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with prefill_ms, decode_ms_per_token, |
dict[str, Any]
|
total_ms, tokens_generated. |