Inference
lmxlab provides generation utilities ranging from simple greedy decoding to advanced strategies like speculative decoding and best-of-N sampling.
Basic Generation
The generate function handles autoregressive generation with
KV caching:
import mlx.core as mx
from lmxlab.models import LanguageModel, generate
from lmxlab.models.llama import llama_tiny
model = LanguageModel(llama_tiny())
mx.eval(model.parameters())
prompt = mx.array([[1, 2, 3]])
output = generate(model, prompt, max_tokens=20, temperature=0.0)
# output shape: (1, 23) -- prompt + generated
Sampling Parameters
| Parameter | Default | Description |
|---|---|---|
temperature |
1.0 | Controls randomness (0 = greedy, higher = more random) |
top_k |
0 | Restrict sampling to top-k tokens (0 = disabled) |
top_p |
1.0 | Nucleus sampling threshold (< 1.0 to enable) |
repetition_penalty |
1.0 | Penalize repeated tokens (> 1.0 to enable) |
stop_tokens |
None | Token IDs that halt generation |
max_tokens |
100 | Maximum new tokens to generate |
Examples
# Creative writing: high temperature + nucleus sampling
output = generate(
model, prompt, max_tokens=200,
temperature=0.9, top_p=0.95,
)
# Focused completion: low temperature + top-k
output = generate(
model, prompt, max_tokens=50,
temperature=0.3, top_k=10,
)
# Chat: stop at EOS token, discourage repetition
output = generate(
model, prompt, max_tokens=512,
temperature=0.7, stop_tokens=[2],
repetition_penalty=1.1,
)
Streaming Generation
For interactive applications, stream_generate yields tokens one
at a time:
from lmxlab.models import stream_generate
for token_id in stream_generate(
model, prompt, max_tokens=100,
temperature=0.8, stop_tokens=[2],
):
# Process each token as it's produced
print(token_id, end=' ', flush=True)
This is useful for real-time display, where tokens are shown as they are generated rather than after the full sequence completes.
Best-of-N Sampling
Generate multiple candidates and select the highest-scoring one:
from lmxlab.inference import best_of_n
# Generate 8 candidates, return the best by log probability
best = best_of_n(
model, prompt, n=8,
max_tokens=100, temperature=0.8,
)
# Length-normalized scoring (fairer for varying lengths)
best = best_of_n(
model, prompt, n=8,
max_tokens=100, temperature=0.8,
score_fn='length_normalized',
)
Each candidate is scored by its total log probability under the model, optionally normalized by length. Selecting the highest-scoring candidate tends to improve generation quality over single-sample decoding.
Majority Vote
For tasks with discrete answers (math, classification, code), generate multiple completions and count the most common:
from lmxlab.inference import majority_vote
results = majority_vote(
model, prompt, n=10,
max_tokens=20, temperature=0.8,
)
# Returns: [(token_list, count), ...] sorted by count
most_common, count = results[0]
print(f'Most common answer ({count}/10): {most_common}')
Speculative Decoding
Use a small draft model to propose tokens, verified by a larger target model in a single forward pass. This can speed up inference when the draft model is much faster and agrees with the target most of the time.
from lmxlab.inference import speculative_decode
from lmxlab.models.gpt import gpt_tiny, gpt_config
# Small draft model
draft = LanguageModel(gpt_tiny())
mx.eval(draft.parameters())
# Larger target model
target = LanguageModel(gpt_config(d_model=256, n_heads=4, n_layers=4))
mx.eval(target.parameters())
output, stats = speculative_decode(
target, draft, prompt,
max_tokens=50, draft_tokens=4,
)
print(f'Acceptance rate: {stats["acceptance_rate"]:.1%}')
Unified Memory Advantage
On Apple Silicon, both models share unified memory, so there is no data transfer overhead between CPU and GPU. The draft and target models coexist in the same memory pool without copies.
How It Works
- The small model generates
draft_tokenscandidate tokens. - The target model processes the full sequence (prompt + drafts) in one forward pass.
- Each drafted token is compared against the target model's distribution.
- Matching tokens are accepted and retained.
- On the first mismatch, the target model's token is used and remaining drafts are discarded.
When the acceptance rate is high, speculative decoding generates
multiple tokens per target model forward pass, providing a
wall-clock speedup proportional to draft_tokens * acceptance_rate.
Evaluation Metrics
After generation, evaluate quality with built-in metrics:
from lmxlab.eval import perplexity, bits_per_byte
# Perplexity on evaluation data
ppl = perplexity(model, eval_batches)
# Bits per byte (comparable across tokenizers)
bpb = bits_per_byte(model, eval_batches, bytes_per_token=3.5)
For code generation tasks, use pass@k:
from lmxlab.eval import pass_at_k, evaluate_pass_at_k
# Single problem: 3 correct out of 10 samples
p = pass_at_k(n=10, c=3, k=1) # 0.3
# Evaluate across problems with a test function
results = evaluate_pass_at_k(
completions, # list of list of code strings
test_fn=run_tests, # returns True if code passes
k_values=[1, 5, 10],
)
print(results) # {'pass@1': 0.42, 'pass@5': 0.78, 'pass@10': 0.91}