Data Pipeline
The lmxlab data pipeline passes raw text through a tokenizer, wraps the result in a dataset, and yields batches for training. Because MLX uses unified memory, the same arrays reside in both CPU and GPU address spaces, so no separate data-loading workers are required.
Tokenizers
All tokenizers implement the Tokenizer protocol: encode(),
decode(), and vocab_size. This makes them interchangeable.
CharTokenizer
Character-level tokenizer for testing and small experiments:
from lmxlab.data import CharTokenizer
# Build vocabulary from your text
text = open('data/shakespeare.txt').read()
tok = CharTokenizer(text)
print(f'Vocab size: {tok.vocab_size}') # ~65 for Shakespeare
ids = tok.encode('To be or not to be')
print(tok.decode(ids)) # 'To be or not to be'
A default ASCII character set (no input text required) is also available:
TiktokenTokenizer
BPE tokenizer using OpenAI's tiktoken (requires pip install tiktoken):
from lmxlab.data import TiktokenTokenizer
# GPT-2 encoding (50257 tokens, good default)
tok = TiktokenTokenizer('gpt2')
ids = tok.encode('Hello, world!')
print(ids) # [15496, 11, 995, 0]
# GPT-4 encoding (larger vocabulary)
tok = TiktokenTokenizer('cl100k_base')
HFTokenizer
Wraps a HuggingFace AutoTokenizer (requires pip install transformers):
from lmxlab.data import HFTokenizer
# Use the tokenizer from a pretrained model
tok = HFTokenizer('meta-llama/Llama-3.2-1B')
ids = tok.encode('Hello, world!')
print(tok.decode(ids))
# Access special tokens
print(f'EOS: {tok.eos_token_id}, BOS: {tok.bos_token_id}')
This tokenizer should be used when working with models loaded via
load_from_hf().
Custom Tokenizers
Any object implementing the Tokenizer protocol works:
from lmxlab.data import Tokenizer
class MyTokenizer:
@property
def vocab_size(self) -> int:
return 1000
def encode(self, text: str) -> list[int]:
...
def decode(self, tokens: list[int]) -> str:
...
Datasets
TextDataset
Takes raw text and a tokenizer, creates sliding windows of (input, target) pairs:
from lmxlab.data import TextDataset, CharTokenizer
text = open('data/shakespeare.txt').read()
tok = CharTokenizer(text)
dataset = TextDataset(text, tok, seq_len=128)
print(f'{len(dataset)} training windows')
# Get a single (input, target) pair
x, y = dataset[0]
# x: tokens[0:128], y: tokens[1:129]
The target is the input shifted by one position (standard next-token prediction).
TokenDataset
For pre-tokenized data (token IDs already available):
import mlx.core as mx
from lmxlab.data import TokenDataset
tokens = mx.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
dataset = TokenDataset(tokens, seq_len=4)
x, y = dataset[0]
# x: [1, 2, 3, 4], y: [2, 3, 4, 5]
HFDataset
Load data directly from HuggingFace datasets (requires pip install datasets):
from lmxlab.data import HFDataset, HFTokenizer
# Use a HuggingFace tokenizer with a HuggingFace dataset
tok = HFTokenizer('meta-llama/Llama-3.2-1B')
ds = HFDataset('wikitext', tok, seq_len=128, config_name='wikitext-2-raw-v1')
# Stream batches for training
for inputs, targets in ds.batch_iterator(batch_size=8, max_batches=100):
# inputs shape: (8, 128)
# targets shape: (8, 128)
logits, _ = model(inputs)
...
HFDataset supports streaming mode for large datasets that don't fit in memory:
Individual tokens can also be iterated:
Batching
batch_iterator creates non-overlapping windows from a flat token
array and yields shuffled batches:
from lmxlab.data import batch_iterator, CharTokenizer
text = open('data/shakespeare.txt').read()
tok = CharTokenizer(text)
tokens = mx.array(tok.encode(text))
# Iterate through batches
for x, y in batch_iterator(tokens, batch_size=32, seq_len=128):
# x shape: (32, 128)
# y shape: (32, 128)
logits, _ = model(x)
...
The iterator:
- Splits the token array into non-overlapping sequences of length
seq_len - Shuffles the sequences (if
shuffle=True, the default) - Yields batches of
batch_sizesequences
No DataLoader needed
Unlike PyTorch, no DataLoader with num_workers is required.
MLX uses unified memory, so the same arrays reside in both CPU
and GPU address spaces. A plain Python iterator suffices.
End-to-End Example
Putting it all together to train a small model:
import mlx.core as mx
from lmxlab.data import CharTokenizer, batch_iterator
from lmxlab.models import LanguageModel
from lmxlab.models.gpt import gpt_config
from lmxlab.training import Trainer, TrainConfig
# 1. Load and tokenize text
text = open('data/shakespeare.txt').read()
tok = CharTokenizer(text)
# 2. Create model matching tokenizer vocab
config = gpt_config(
vocab_size=tok.vocab_size,
d_model=128, n_heads=4, n_layers=4,
)
model = LanguageModel(config)
# 3. Set up training
tokens = mx.array(tok.encode(text))
train_config = TrainConfig(
learning_rate=1e-3,
max_steps=500,
batch_size=32,
)
trainer = Trainer(model, train_config)
# 4. Train
batches = batch_iterator(tokens, batch_size=32, seq_len=128)
history = trainer.train(batches)
# 5. Generate
from lmxlab.models import stream_generate
prompt = mx.array([tok.encode('HAMLET:\n')])
for token_id in stream_generate(
model, prompt, max_tokens=200,
temperature=0.8,
):
print(tok.decode([token_id]), end='', flush=True)
Choosing a Tokenizer
| Tokenizer | Vocab Size | Best For |
|---|---|---|
CharTokenizer |
~65-95 | Testing, tiny experiments, debugging |
TiktokenTokenizer('gpt2') |
50,257 | General text, GPT-style models |
TiktokenTokenizer('cl100k_base') |
100,256 | Large models, multilingual |
HFTokenizer('repo-id') |
Varies | Pretrained HuggingFace models |
For real training, BPE tokenizers produce better results because
they capture subword patterns. Use HFTokenizer when working with
pretrained models from HuggingFace (loaded via load_from_hf).
Character tokenizers are useful for fast iteration and testing.