propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

2026-02-07 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, extended

Summary

ellamind introduces propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) designed for multi-property document annotation, moving beyond single scalar quality scores. These Qwen-3-based models annotate text across 18 properties in six categories, supporting 57 languages and producing structured JSON outputs. The 4B model achieves an overall agreement score of 0.779 against Gemini-3-Pro, outperforming larger general-purpose models, and processes 27.0 documents per second on a single H100 GPU. The project also releases "propella-annotations," a dataset of over three billion document annotations covering major pretraining corpora like FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC, under permissive commercial-use licenses. This multi-dimensional annotation reveals significant differences in quality, reasoning depth, and content composition across datasets and languages, which single-score methods cannot capture.

Key takeaway

For AI Scientists and NLP Engineers curating pretraining datasets, relying solely on single-score quality metrics is insufficient. You should integrate multi-property annotation tools like propella-1 to gain granular insights into content integrity, reasoning depth, commercial bias, and safety across diverse languages. This approach enables more precise, task-specific data filtering and mixture design, potentially leading to more efficient model training and improved downstream performance by addressing specific quality dimensions that single scores mask.

Key insights

Multi-property LLM annotation offers granular, multilingual data curation beyond single-score quality metrics.

Principles

Data quality is multi-dimensional.
Small, specialized LLMs can outperform larger general models for specific tasks.
Structured annotations enable flexible, compositional filtering.

Method

Fine-tune decoder-only LLMs (Qwen-3 architecture) with a 64K context length on frontier LLM-generated labels, using a detailed rubric, to produce structured JSON annotations for 18 properties across 57 languages.

In practice

Use propella-1 models for detailed, multilingual data quality assessment.
Filter pretraining data using compositional predicates across 18 properties.
Analyze dataset composition with propella-annotations to inform data mixture design.

Topics

LLM Data Curation
Multi-Property Annotation
Multilingual LLMs
Pretraining Datasets
Data Quality Assessment

Code references

guidance-ai/llguidance

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.