Data
Tokenizers, datasets, and batching utilities for feeding data to models.
Overview
The data pipeline follows a simple flow:
Three tokenizer implementations are provided:
- CharTokenizer: character-level tokenization (good for learning, no dependencies)
- TiktokenTokenizer: OpenAI's BPE tokenizer (GPT-2/GPT-4 vocabularies)
- HFTokenizer: wraps any HuggingFace
AutoTokenizer
Usage
import mlx.core as mx
from lmxlab.data import CharTokenizer, TextDataset, batch_iterator
# Character-level tokenizer
tok = CharTokenizer("the quick brown fox")
ids = tok.encode("the fox")
print(tok.decode(ids)) # "the fox"
# Create dataset with next-token prediction targets
ds = TextDataset("the quick brown fox jumps over the lazy dog", tok, seq_len=16)
x, y = ds[0] # x = input tokens, y = shifted targets
# Batch iterator for training
tokens = mx.array(tok.encode("..." * 1000), dtype=mx.int32)
for x_batch, y_batch in batch_iterator(tokens, batch_size=4, seq_len=32):
# x_batch.shape == (4, 32), y_batch.shape == (4, 32)
pass
Tokenizer
lmxlab.data.tokenizer
Tokenizer protocol and implementations.
CharTokenizer
Character-level tokenizer.
Simple tokenizer that maps each unique character to an ID. Useful for testing and small-scale experiments.
Can be initialized with text directly, or created with
default ASCII printable characters (no args). Use
fit() to rebuild the vocabulary from new text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str | None
|
Text to build vocabulary from. If None, uses ASCII printable characters (32-126). |
None
|
Source code in src/lmxlab/data/tokenizer.py
vocab_size
property
Size of the vocabulary.
decode(tokens)
Decode token IDs back to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[int]
|
List of token IDs. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Decoded string. |
encode(text)
Encode text to character-level token IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input string. |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
List of token IDs. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If text contains unknown characters. |
Source code in src/lmxlab/data/tokenizer.py
fit(text)
Build vocabulary from text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to extract characters from. |
required |
Source code in src/lmxlab/data/tokenizer.py
HFTokenizer
HuggingFace tokenizer wrapper.
Wraps a HuggingFace AutoTokenizer for use with lmxlab.
Use this when working with pretrained models loaded via
load_from_hf.
Requires transformers to be installed::
pip install transformers
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HuggingFace model repo ID or local path (e.g., 'meta-llama/Llama-3.2-1B'). |
required |
Example
tok = HFTokenizer('meta-llama/Llama-3.2-1B') tok.encode('hello world') [15339, 1917] tok.decode([15339, 1917]) 'hello world'
Source code in src/lmxlab/data/tokenizer.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | |
bos_token_id
property
Beginning-of-sequence token ID, if available.
eos_token_id
property
End-of-sequence token ID, if available.
vocab_size
property
Size of the vocabulary.
decode(tokens)
Decode token IDs back to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[int]
|
List of token IDs. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Decoded string. |
encode(text)
Encode text to token IDs.
Does not add special tokens (BOS/EOS) by default, so the output matches what the model expects for continuation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input string. |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
List of token IDs. |
Source code in src/lmxlab/data/tokenizer.py
TiktokenTokenizer
BPE tokenizer using OpenAI's tiktoken.
Wraps a tiktoken encoding for use with lmxlab. Supports any tiktoken encoding name (e.g. 'gpt2', 'cl100k_base', 'o200k_base').
Requires tiktoken to be installed::
pip install tiktoken
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
encoding_name
|
str
|
Name of the tiktoken encoding. Defaults to 'gpt2' (50257 tokens). |
'gpt2'
|
Example
tok = TiktokenTokenizer('gpt2') tok.encode('hello world') [31373, 995] tok.decode([31373, 995]) 'hello world'
Source code in src/lmxlab/data/tokenizer.py
vocab_size
property
Size of the vocabulary.
decode(tokens)
Decode BPE token IDs back to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[int]
|
List of token IDs. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Decoded string. |
encode(text)
Encode text to BPE token IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input string. |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
List of token IDs. |
Tokenizer
Bases: Protocol
Protocol for tokenizers.
All tokenizers must implement encode/decode and expose their vocabulary size.
Source code in src/lmxlab/data/tokenizer.py
vocab_size
property
Size of the vocabulary.
decode(tokens)
Decode token IDs to text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[int]
|
List of token IDs. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Decoded string. |
encode(text)
Encode text to token IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input string. |
required |
Returns:
| Type | Description |
|---|---|
list[int]
|
List of token IDs. |
Datasets
lmxlab.data.dataset.TextDataset
Dataset that tokenizes raw text.
Tokenizes text and stores as a flat array of token IDs. Yields overlapping windows of (input, target) pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Raw text to tokenize. |
required |
tokenizer
|
Tokenizer
|
Tokenizer to use. |
required |
seq_len
|
int
|
Sequence length for training windows. |
128
|
Source code in src/lmxlab/data/dataset.py
seq_len = seq_len
instance-attribute
tokenizer = tokenizer
instance-attribute
tokens = mx.array(tokens, dtype=(mx.int32))
instance-attribute
__getitem__(idx)
Get a (input, target) pair at the given index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Starting position in the token array. |
required |
Returns:
| Type | Description |
|---|---|
array
|
Tuple of (input_tokens, target_tokens), each |
array
|
of shape (seq_len,). |
Source code in src/lmxlab/data/dataset.py
__init__(text, tokenizer, seq_len=128)
lmxlab.data.dataset.TokenDataset
Dataset from pre-tokenized data.
Wraps an existing array of token IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
array
|
Array of token IDs. |
required |
seq_len
|
int
|
Sequence length for training windows. |
128
|
Source code in src/lmxlab/data/dataset.py
seq_len = seq_len
instance-attribute
tokens = tokens
instance-attribute
__getitem__(idx)
Get a (input, target) pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Starting position. |
required |
Returns:
| Type | Description |
|---|---|
tuple[array, array]
|
Tuple of (input_tokens, target_tokens). |
Source code in src/lmxlab/data/dataset.py
__init__(tokens, seq_len=128)
lmxlab.data.dataset.HFDataset
Dataset backed by a HuggingFace dataset.
Streams or loads a HuggingFace dataset, tokenizes on-the-fly, and yields batches of (input, target) pairs.
Requires the datasets package (pip install datasets).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
HuggingFace dataset name (e.g. |
required |
tokenizer
|
Tokenizer
|
Tokenizer implementing the Tokenizer protocol. |
required |
seq_len
|
int
|
Sequence length for training windows. |
128
|
split
|
str
|
Dataset split to use. |
'train'
|
text_field
|
str
|
Name of the text column in the dataset. |
'text'
|
config_name
|
str | None
|
Optional dataset configuration name. |
None
|
streaming
|
bool
|
Whether to stream the dataset. |
False
|
Source code in src/lmxlab/data/dataset.py
_dataset = load_dataset(name, config_name, split=split, streaming=streaming)
instance-attribute
_streaming = streaming
instance-attribute
seq_len = seq_len
instance-attribute
text_field = text_field
instance-attribute
tokenizer = tokenizer
instance-attribute
__init__(name, tokenizer, seq_len=128, split='train', text_field='text', config_name=None, streaming=False)
Source code in src/lmxlab/data/dataset.py
batch_iterator(batch_size=8, max_batches=None)
Yield (input, target) batches from the dataset.
Accumulates tokens into a buffer and yields batches
of shape (batch_size, seq_len).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Number of sequences per batch. |
8
|
max_batches
|
int | None
|
Stop after this many batches. |
None
|
Yields:
| Type | Description |
|---|---|
array
|
Tuple of (inputs, targets), each of shape |
array
|
|
Source code in src/lmxlab/data/dataset.py
token_iterator()
Yield token IDs one at a time from the dataset.
Batching
lmxlab.data.batching
Batch iterator for MLX training.
batch_iterator(tokens, batch_size, seq_len, shuffle=True)
Yield batches of (input, target) pairs from a token array.
Creates non-overlapping windows from the token array, optionally shuffles, and yields batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
array
|
Flat array of token IDs. |
required |
batch_size
|
int
|
Number of sequences per batch. |
required |
seq_len
|
int
|
Length of each sequence. |
required |
shuffle
|
bool
|
Whether to shuffle windows each epoch. |
True
|
Yields:
| Type | Description |
|---|---|
array
|
Tuples of (input_batch, target_batch), each of |
array
|
shape (batch_size, seq_len). |