Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
Summary
TOTEN, a knowledge-based ontological tokenization framework, addresses the semantic blindness of Byte-Pair Encoding by replacing statistical derivation with declarative classification for physical quantities and technical notation in Brazilian Portuguese. Formalized as a triple involving an ontology of engineering entities (OEE), a classification function, and an instantiator family, TOTEN ensures robustness through deterministic coupling with external oracles like Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation on the EngQuant benchmark (N=800) and four external Brazilian Portuguese corpora (N=1771) demonstrated unit ontological atomicity and superior numerical reconstruction. TOTEN achieved 0.775-0.904 on external corpora versus 0.627-0.703 for the best baseline (Quantulum3), and 0.780 versus 0.340 on EngQuant, with statistically significant differences.
Key takeaway
For NLP Engineers and Machine Learning Engineers working with Brazilian Portuguese technical documents, TOTEN presents a compelling alternative to traditional statistical tokenization. Its knowledge-based ontological approach significantly improves the accurate handling of physical quantities and technical notation, achieving superior numerical reconstruction and ontological atomicity. You should consider integrating such a framework to enhance the semantic understanding and robustness of your models when processing scientific or engineering texts.
Key insights
TOTEN offers knowledge-based ontological tokenization, outperforming statistical methods for technical entities in Brazilian Portuguese.
Principles
- Statistical tokenization is semantically blind to structured technical entities.
- Knowledge-based ontological tokenization improves accuracy for physical quantities.
- Deterministic coupling with external oracles enhances robustness.
Method
TOTEN formalizes tokenization as a triple: an ontology (types, relations), a classification function (raw text to typed regions), and an instantiator family (structured representation).
In practice
- Improve NLP for Brazilian Portuguese technical texts.
- Accurately process physical quantities and units.
Topics
- Tokenization
- Natural Language Processing
- Brazilian Portuguese
- Ontologies
- Physical Quantities
- Technical Notation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.