Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Knowledge Representation & Ontological Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

TOTEN is a knowledge-based ontological tokenization framework designed for Brazilian Portuguese, addressing the semantic blindness of statistical tokenizers like Byte-Pair Encoding. It prevents the fragmentation of structured technical entities such as physical quantities, numbers, units, and symbolic expressions. The system formalizes tokenization through a declarative classification grounded in a formal ontology of engineering entities (OEE), leveraging external oracles including Pint for dimensional analysis, the Unicode Character Database for typography, and RSLP for Portuguese morphology. Intrinsic evaluation on an internal benchmark (EngQuant, N=800) and four external Brazilian Portuguese corpora (N=1 771 cases) demonstrated TOTEN's superior performance. It achieved unit ontological atomicity in all contrasts and numerical reconstruction scores of 0.775 to 0.904 on external corpora, significantly outperforming the best baseline, Quantulum3 (0.627–0.703). On the internal benchmark, TOTEN scored 0.780 against Quantulum3's 0.340, with statistically significant differences in atomicity and reconstruction.

Key takeaway

For NLP Engineers developing models for technical Brazilian Portuguese, traditional statistical tokenization methods lead to significant semantic fragmentation of quantities and notation. You should consider integrating knowledge-based ontological tokenization like TOTEN to preserve the intrinsic structure of technical entities. This approach improves numerical reconstruction and atomicity, offering a more robust input representation for downstream models and enhancing overall model performance on specialized texts.

Key insights

TOTEN offers knowledge-based ontological tokenization for technical text, preserving semantic structure of physical quantities and notation in Brazilian Portuguese.

Principles

Method

TOTEN classifies raw text into typed regions using a formal ontology (OEE) and external oracles (Pint, UCD, RSLP), then instantiates a structured representation for each type.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.