Evaluation
Metrics for language model evaluation.
Overview
lmxlab provides standard evaluation metrics:
- Perplexity: exponential of the average cross-entropy loss. Lower is better. A perplexity of 10 means the model is as uncertain as choosing uniformly among 10 tokens.
- Bits-per-byte (BPB): cross-entropy loss normalized by bytes. This is tokenizer-independent, making it useful for comparing models with different vocabularies.
- pass@k: estimates the probability that at least one of k code samples passes a set of tests (Chen et al., 2021). Used for code generation evaluation.
Usage
import mlx.core as mx
from lmxlab.models.base import LanguageModel
from lmxlab.models.gpt import gpt_tiny
from lmxlab.eval import perplexity, bits_per_byte, pass_at_k
model = LanguageModel(gpt_tiny())
mx.eval(model.parameters())
tokens = mx.array([[1, 2, 3, 4, 5, 6, 7, 8]])
# Perplexity on a sequence
ppl = perplexity(model, tokens)
print(f"Perplexity: {ppl:.1f}")
# Bits-per-byte (tokenizer-independent)
bpb = bits_per_byte(model, tokens, bytes_per_token=3.5)
print(f"BPB: {bpb:.3f}")
# pass@k for code generation
# If 3 out of 10 samples pass, estimate pass@1
p1 = pass_at_k(n=10, c=3, k=1)
print(f"pass@1: {p1:.3f}")
Metrics
lmxlab.eval.metrics.perplexity(model, data)
Compute perplexity over a dataset.
PPL = exp(average cross-entropy loss)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
LanguageModel
|
Language model. |
required |
data
|
list[array]
|
List of token ID arrays, each (batch, seq_len). |
required |
Returns:
| Type | Description |
|---|---|
float
|
Perplexity score (lower is better). |
Source code in src/lmxlab/eval/metrics.py
lmxlab.eval.metrics.bits_per_byte(model, data, bytes_per_token=1.0)
Compute bits-per-byte (BPB).
BPB = (cross-entropy in nats) / (ln(2) * bytes_per_token)
For character-level tokenizers, bytes_per_token ≈ 1.0. For BPE tokenizers, estimate from data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
LanguageModel
|
Language model. |
required |
data
|
list[array]
|
List of token ID arrays. |
required |
bytes_per_token
|
float
|
Average bytes per token. |
1.0
|
Returns:
| Type | Description |
|---|---|
float
|
BPB score (lower is better). |
Source code in src/lmxlab/eval/metrics.py
Code Generation Evaluation
lmxlab.eval.metrics.pass_at_k(n, c, k)
Compute pass@k metric (Chen et al., 2021, arXiv:2107.03374).
Estimates the probability that at least one of k samples passes a given test, given that c of n total samples pass. Uses the unbiased estimator from the Codex paper.
pass@k = 1 - C(n-c, k) / C(n, k)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Total number of generated samples. |
required |
c
|
int
|
Number of samples that pass the test. |
required |
k
|
int
|
Number of samples to consider. |
required |
Returns:
| Type | Description |
|---|---|
float
|
pass@k probability in [0, 1]. |
Example::
# 10 samples generated, 3 pass the test
p1 = pass_at_k(n=10, c=3, k=1) # ~0.30
p5 = pass_at_k(n=10, c=3, k=5) # ~0.83
Source code in src/lmxlab/eval/metrics.py
lmxlab.eval.metrics.evaluate_pass_at_k(completions, test_fn, k_values=None)
Evaluate pass@k over multiple problems.
For each problem, generates N completions and checks how many pass using the provided test function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
completions
|
list[list[str]]
|
List of problems, each a list of N completion strings. |
required |
test_fn
|
Callable[[str], bool]
|
Function that returns True if a completion is correct. |
required |
k_values
|
list[int] | None
|
Values of k to evaluate. Default: [1, 5, 10]. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict mapping 'pass@k' to the average score across |
dict[str, float]
|
problems. |