A Triadic Suffix Tokenization Scheme for Numerical Reasoning

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

Triadic Suffix Tokenization (TST) is a novel, deterministic scheme designed to improve large language models' (LLMs) numerical reasoning by addressing inconsistent number fragmentation in standard subword tokenization. TST partitions digits into three-digit triads and annotates each with an explicit magnitude marker, providing a consistent gradient signal for numerical relationships. For integers, suffixes like 'k' for thousands or 'm' for millions are used, while fractional parts use replicated 'p' markers (e.g., 'p', 'pp') to denote decimal depth. The scheme supports a range of $10^{-15}$ to $10^{18}$ (33 orders of magnitude) and offers two implementation variants: a vocabulary-based approach adding up to 10,000 fixed tokens, or a suffix-marker approach using a small set of special tokens dynamically. TST is architecture-agnostic, preserving exact digits and making order-of-magnitude relationships transparent at the token level, and can be integrated as a preprocessing step.

Key takeaway

For research scientists developing or fine-tuning LLMs for numerical tasks, implementing Triadic Suffix Tokenization (TST) as a preprocessing step could significantly enhance arithmetic and scientific reasoning. TST's explicit magnitude encoding and deterministic fractional representation offer a stronger inductive bias, potentially leading to faster, more stable convergence and reduced inference errors compared to standard tokenization methods. Consider evaluating TST on benchmarks like NumericBench to validate its impact on your model's numerical capabilities.

Key insights

Triadic Suffix Tokenization improves LLM numerical reasoning by explicitly encoding magnitude and decimal depth into number tokens.

Principles

Group digits into triads.
Annotate each triad with explicit magnitude markers.
Preserve exact digits.

Method

TST groups digits into three-digit triads, annotating integer triads with magnitude suffixes (e.g., 'k', 'm') and fractional triads with replicated 'p' markers, right-padding to three digits for canonical representation.

In practice

Integrate TST as a drop-in preprocessing step.
Combine TST with Number Token Loss (NTL) for synergistic benefits.
Extend TST range by adding new suffix tokens.

Topics

Triadic Suffix Tokenization
Numerical Reasoning
Large Language Models
Subword Tokenization
Magnitude Annotation

Code references

olgachetverina/triadic-suffix-tokenization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.