propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Summary
ellamind introduces propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) designed for multi-property document annotation, moving beyond single scalar quality scores. These Qwen-3-based models annotate text across 18 properties in six categories, supporting 57 languages and producing structured JSON outputs. The 4B model achieves an overall agreement score of 0.779 against Gemini-3-Pro, outperforming larger general-purpose models, and processes 27.0 documents per second on a single H100 GPU. The project also releases "propella-annotations," a dataset of over three billion document annotations covering major pretraining corpora like FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC, under permissive commercial-use licenses. This multi-dimensional annotation reveals significant differences in quality, reasoning depth, and content composition across datasets and languages, which single-score methods cannot capture.
Key takeaway
For AI Scientists and NLP Engineers curating pretraining datasets, relying solely on single-score quality metrics is insufficient. You should integrate multi-property annotation tools like propella-1 to gain granular insights into content integrity, reasoning depth, commercial bias, and safety across diverse languages. This approach enables more precise, task-specific data filtering and mixture design, potentially leading to more efficient model training and improved downstream performance by addressing specific quality dimensions that single scores mask.
Key insights
Multi-property LLM annotation offers granular, multilingual data curation beyond single-score quality metrics.
Principles
- Data quality is multi-dimensional.
- Small, specialized LLMs can outperform larger general models for specific tasks.
- Structured annotations enable flexible, compositional filtering.
Method
Fine-tune decoder-only LLMs (Qwen-3 architecture) with a 64K context length on frontier LLM-generated labels, using a detailed rubric, to produce structured JSON annotations for 18 properties across 57 languages.
In practice
- Use propella-1 models for detailed, multilingual data quality assessment.
- Filter pretraining data using compositional predicates across 18 properties.
- Analyze dataset composition with propella-annotations to inform data mixture design.
Topics
- LLM Data Curation
- Multi-Property Annotation
- Multilingual LLMs
- Pretraining Datasets
- Data Quality Assessment
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.