Experiment Results
This page presents findings from lmxlab's trusted experiments. All results use proper methodology: train/val splits, validation loss as the primary metric (DEC-008), and FLOP-matched compute budgets (DEC-004).
Methodology evolution
Early experiments (HYP-001, HYP-001b) had no validation split and reported training loss as the primary metric, which masked overfitting. Results from HYP-001c onward use corrected methodology and are the ones presented here. See Methodology for details.
HYP-001 Series: GPT vs LLaMA Ablation
Setup: 3M params, char-level Shakespeare, 1 PFLOPs budget, 3 seeds per config.
This series tested whether LLaMA-style features (RMSNorm, RoPE, SwiGLU, GQA, no bias) improve over a GPT-2 baseline at small scale. After two rounds with methodology issues, HYP-001c and HYP-001d produced reliable results.
HYP-001c: Feature Ablation (lr=3e-4 vs 1e-4)
| Config | Val Loss | Train Loss | Gap |
|---|---|---|---|
| GPT baseline (lr=3e-4) | 1.609 +/- 0.008 | 0.781 | -0.83 |
| + RMSNorm (lr=1e-4) | 1.667 +/- 0.013 | 0.864 | -0.80 |
| + RoPE (lr=3e-4) | 1.607 +/- 0.005 | 0.767 | -0.84 |
| + SwiGLU FFN (lr=3e-4) | 1.614 +/- 0.011 | 0.682 | -0.93 |
| + GQA (lr=3e-4) | 1.608 +/- 0.012 | 0.687 | -0.92 |
| + No bias = LLaMA (lr=1e-4) | 1.670 +/- 0.012 | 0.819 | -0.85 |
All configs show massive overfitting (train-val gap ~0.83-0.93) from 8-9 epochs of repeated Shakespeare data. Architecture differences are within noise on val loss.
HYP-001d: Dropout x Architecture
| Architecture | Dropout | Val Loss | Train Loss | Gap |
|---|---|---|---|---|
| GPT | 0.0 | 1.611 +/- 0.009 | 0.778 | -0.83 |
| GPT | 0.1 | 1.573 +/- 0.002 | 1.188 | -0.38 |
| GPT | 0.2 | 1.560 +/- 0.002 | 1.265 | -0.30 |
| LLaMA | 0.0 | 1.671 +/- 0.013 | 0.813 | -0.86 |
| LLaMA | 0.1 | 1.570 +/- 0.001 | 1.283 | -0.29 |
| LLaMA | 0.2 | 1.586 +/- 0.009 | 1.354 | -0.23 |
Key findings:
- Dropout dominates architecture. Adding dropout=0.1 improves val loss by 0.05-0.10 (Cohen's d > 5), dwarfing all architecture effects.
- At dropout=0.1, architectures equalize. GPT and LLaMA reach nearly identical val loss (1.573 vs 1.570).
- Non-monotonic interaction: LLaMA peaks at dropout=0.1 then degrades at 0.2; GPT continues improving. This was later shown to be a data-repetition artifact (see HYP-006).
HYP-006: Dropout x Normalization at Scale
Setup: 30M params, TinyStories BPE tokenization, 2000 steps (single epoch), 3 seeds per config.
This experiment tested whether the dropout x normalization interaction from HYP-001d replicates at larger scale with proper tokenization.
GPT-30M Results
| Dropout | Val Loss | Train Loss | Gap |
|---|---|---|---|
| 0.0 | 3.049 +/- 0.025 | 3.009 | -0.04 |
| 0.1 | 3.150 +/- 0.044 | 3.162 | +0.01 |
| 0.2 | 3.335 +/- 0.058 | 3.382 | +0.05 |
| 0.3 | 3.493 +/- 0.050 | 3.538 | +0.05 |
LLaMA-30M Results
| Dropout | Val Loss | Train Loss | Gap |
|---|---|---|---|
| 0.0 | 2.512 +/- 0.013 | 3.309 | +0.80 |
| 0.1 | 2.578 +/- 0.012 | 3.408 | +0.83 |
| 0.2 | 2.656 +/- 0.008 | 3.499 | +0.84 |
| 0.3 | 2.745 +/- 0.005 | 3.593 | +0.85 |
Architecture Comparison
| Dropout | GPT Val Loss | LLaMA Val Loss | Gap | Cohen's d |
|---|---|---|---|---|
| 0.0 | 3.049 | 2.512 | -0.54 | -32.8 |
| 0.1 | 3.150 | 2.578 | -0.57 | -17.6 |
| 0.2 | 3.335 | 2.656 | -0.68 | -16.5 |
| 0.3 | 3.493 | 2.745 | -0.75 | -21.3 |
Key findings:
- LLaMA dominates at 30M. With BPE tokenization and realistic data, LLaMA outperforms GPT by 0.54 val loss, completely reversing the 3M char-level finding.
- Dropout hurts in the undertrained regime. Both architectures degrade with higher dropout when training for less than one epoch. This is consistent with the literature: dropout only helps with data repetition.
- The HYP-001d interaction does not replicate. The non-monotonic dropout x normalization phenomenon was specific to the multi-epoch char-level regime. It was a data-repetition artifact, not a fundamental architecture property.
Hybrid Architecture Baselines
Setup: 10M params, TinyStories BPE tokenization, 2000 steps, single seed.
Five architectures compared: two pure-attention (GPT, LLaMA) and three SSM+attention hybrids (Falcon-H1, Jamba, Bamba).
| Rank | Architecture | Val Loss | Params | Wall Time |
|---|---|---|---|---|
| 1 | Falcon-H1 | 2.616 | 9.30M | 314s |
| 1 | Bamba | 2.616 | 9.30M | 343s |
| 3 | Jamba | 2.629 | 10.19M | 383s |
| 4 | LLaMA | 2.710 | 9.88M | 250s |
| 5 | GPT | 3.132 | 9.61M | 255s |
Key findings:
- Modern transformer features contribute more than SSM mixing. The GPT-to-LLaMA gap (0.42 val loss, 16%) is much larger than the LLaMA-to-hybrid gap (0.09 val loss, 4%).
- Falcon-H1 and Bamba tie at 2.616 val loss, with Falcon-H1 being faster (314s vs 343s).
- SSM hybrids are slower than pure-attention models due to sequential SSM scan operations (250-255s vs 314-383s).
Key Findings Across Experiments
Architecture matters at 10-30M, not at 3M
At 3M params with char-level tokenization, architecture differences wash out entirely; regularization (dropout) is the dominant factor. At 10-30M with BPE tokenization, architecture produces massive effect sizes (Cohen's d > 15).
Modern transformer features >> SSM mixing
The biggest gains come from adopting LLaMA-style features (RMSNorm, RoPE, SwiGLU, GQA) over the GPT-2 baseline. Adding SSM layers provides a further 4% improvement, but the bulk of the gain is from the attention-side improvements.
Dropout hurts in undertrained regimes
Dropout is highly beneficial when training on repeated data (multi-epoch) but actively harmful when training for less than one epoch. This explains the conflicting findings between HYP-001d (8-9 epochs, dropout helps) and HYP-006 (single epoch, dropout hurts).
The dropout x normalization interaction was an artifact
The non-monotonic interaction between dropout rate and normalization type (ANOM-009/010/011) was specific to the multi-epoch char-level regime and does not generalize. It was caused by data repetition, not a fundamental property of LayerNorm vs RMSNorm.
Methodology Notes
See the Methodology page for full details on how experiments are designed, controlled, and tracked. Key methodology improvements over the course of this work:
- HYP-001 / HYP-001b: No validation split. Training loss reported as primary metric. Results unreliable; superseded.
- HYP-001c: Added 90/10 train/val split. Revealed massive overfitting that was invisible in prior rounds.
- HYP-001d: Val loss as primary metric (DEC-008). Discovered dropout as dominant intervention.
- HYP-006: Scaled to 30M params with BPE tokenization. Showed that small-scale char-level findings don't generalize.
- Hybrid baselines: Extended to SSM+attention architectures. Confirmed modern features matter more than mixing strategy.