LLM-TOKENIZER-VIZ // DALIOSHIN

OVERVIEW

A zero-dependency browser tool for visualizing how different language models tokenize text. Paste any string and see token boundaries, IDs, and fertility scores side-by-side across GPT-4, Claude, and Llama tokenizers.

MOTIVATION

When debugging prompt engineering issues, token counts and boundaries matter enormously. Existing tools either require API calls or only support a single tokenizer. This runs entirely client-side via WASM-compiled Rust tokenizers.

TECHNICAL APPROACH

Input text
    │
    ▼
┌──────────────────────────────────┐
│  Tokenizer WASM modules (Rust)   │
│  • tiktoken (GPT-4 / cl100k)     │
│  • sentencepiece (Llama)         │
│  • huggingface tokenizers        │
└──────────────────────────────────┘
    │
    ▼
Color-coded token spans + stats

Each tokenizer is compiled to WebAssembly and loaded lazily — first load is ~400KB, subsequent tokenizers are cached in IndexedDB.

FEATURES

Side-by-side comparison of up to 4 tokenizers simultaneously
Token fertility ratio (tokens per word)
Special token highlighting
Copy token IDs as JSON array
Shareable URL state