Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

TOTEN, a knowledge-based ontological tokenization framework, addresses the semantic blindness of Byte-Pair Encoding by replacing statistical derivation with declarative classification for physical quantities and technical notation in Brazilian Portuguese. Formalized as a triple involving an ontology of engineering entities (OEE), a classification function, and an instantiator family, TOTEN ensures robustness through deterministic coupling with external oracles like Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation on the EngQuant benchmark (N=800) and four external Brazilian Portuguese corpora (N=1771) demonstrated unit ontological atomicity and superior numerical reconstruction. TOTEN achieved 0.775-0.904 on external corpora versus 0.627-0.703 for the best baseline (Quantulum3), and 0.780 versus 0.340 on EngQuant, with statistically significant differences.

Key takeaway

For NLP Engineers and Machine Learning Engineers working with Brazilian Portuguese technical documents, TOTEN presents a compelling alternative to traditional statistical tokenization. Its knowledge-based ontological approach significantly improves the accurate handling of physical quantities and technical notation, achieving superior numerical reconstruction and ontological atomicity. You should consider integrating such a framework to enhance the semantic understanding and robustness of your models when processing scientific or engineering texts.

Key insights

TOTEN offers knowledge-based ontological tokenization, outperforming statistical methods for technical entities in Brazilian Portuguese.

Principles

Statistical tokenization is semantically blind to structured technical entities.
Knowledge-based ontological tokenization improves accuracy for physical quantities.
Deterministic coupling with external oracles enhances robustness.

Method

TOTEN formalizes tokenization as a triple: an ontology (types, relations), a classification function (raw text to typed regions), and an instantiator family (structured representation).

In practice

Improve NLP for Brazilian Portuguese technical texts.
Accurately process physical quantities and units.

Topics

Tokenization
Natural Language Processing
Brazilian Portuguese
Ontologies
Physical Quantities
Technical Notation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.