Assessing the Business Process Modeling Competences of Large Language Models
Summary
BEF4LLM is a novel evaluation framework for assessing Large Language Models' (LLMs) competence in generating Business Process Model and Notation (BPMN) models from natural language descriptions. This framework comprises four perspectives: syntactic quality, pragmatic quality, semantic quality, and validity. A comprehensive analysis of 17 open-source LLMs, including Llama 3, Qwen 2.5, Qwen 3, and Deepseek-R1, was conducted on 105 curated text-BPMN pairs, with 5 runs per sample. Results indicate LLMs excel in syntactic and pragmatic quality (scores consistently above 0.75 and 0.8, respectively), but human experts outperform LLMs in semantic aspects (human score 0.5152). Validity, especially generating valid BPMN-XML files, remains a major challenge for most LLMs, with Llama 3.1 70b achieving the highest validity at 97.33%. Larger LLMs do not always yield better results, sometimes degrading pragmatic quality.
Key takeaway
For AI Scientists and ML Engineers developing LLM-driven BPM tools, prioritize instruction-tuned models. Implement robust validation and refinement for BPMN-XML output. While larger models boost syntactic/semantic quality, smaller LLMs often deliver superior pragmatic results. Parameter count alone doesn't guarantee overall quality; select based on specific needs. Focus fine-tuning efforts on improving semantic accuracy and ensuring valid XML generation to enhance practical deployment.
Key insights
LLMs show strong syntactic and pragmatic BPMN generation but struggle with semantic accuracy and XML validity.
Principles
- LLM parameter count does not monotonically improve BPMN quality.
- Instruction-tuned LLMs consistently outperform non-instruction-tuned ones.
- Human experts maintain consistent BPMN quality with lower variance.
Method
BEF4LLM is a four-perspective framework (syntactic, pragmatic, semantic, validity) using 39 metrics to evaluate LLM-generated BPMN models against ground truth and human experts.
In practice
- Prioritize instruction-tuned LLMs for BPMN generation tasks.
- Use smaller LLMs like Llama 3 8b for better pragmatic quality.
- Implement refinement loops for LLM-generated BPMN-XML validity.
Topics
- Business Process Modeling Notation
- Large Language Models
- LLM Evaluation Frameworks
- Text-to-Process Generation
- Process Model Quality
- BPMN-XML Validity
Code references
Best for: AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.