Assessing the Business Process Modeling Competences of Large Language Models

2026-06-30 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Business Process Management · Depth: Expert, extended

Summary

BEF4LLM is a novel evaluation framework for assessing Large Language Models' (LLMs) competence in generating Business Process Model and Notation (BPMN) models from natural language descriptions. This framework comprises four perspectives: syntactic quality, pragmatic quality, semantic quality, and validity. A comprehensive analysis of 17 open-source LLMs, including Llama 3, Qwen 2.5, Qwen 3, and Deepseek-R1, was conducted on 105 curated text-BPMN pairs, with 5 runs per sample. Results indicate LLMs excel in syntactic and pragmatic quality (scores consistently above 0.75 and 0.8, respectively), but human experts outperform LLMs in semantic aspects (human score 0.5152). Validity, especially generating valid BPMN-XML files, remains a major challenge for most LLMs, with Llama 3.1 70b achieving the highest validity at 97.33%. Larger LLMs do not always yield better results, sometimes degrading pragmatic quality.

Key takeaway

For AI Scientists and ML Engineers developing LLM-driven BPM tools, prioritize instruction-tuned models. Implement robust validation and refinement for BPMN-XML output. While larger models boost syntactic/semantic quality, smaller LLMs often deliver superior pragmatic results. Parameter count alone doesn't guarantee overall quality; select based on specific needs. Focus fine-tuning efforts on improving semantic accuracy and ensuring valid XML generation to enhance practical deployment.

Key insights

LLMs show strong syntactic and pragmatic BPMN generation but struggle with semantic accuracy and XML validity.

Principles

LLM parameter count does not monotonically improve BPMN quality.
Instruction-tuned LLMs consistently outperform non-instruction-tuned ones.
Human experts maintain consistent BPMN quality with lower variance.

Method

BEF4LLM is a four-perspective framework (syntactic, pragmatic, semantic, validity) using 39 metrics to evaluate LLM-generated BPMN models against ground truth and human experts.

In practice

Prioritize instruction-tuned LLMs for BPMN generation tasks.
Use smaller LLMs like Llama 3 8b for better pragmatic quality.
Implement refinement loops for LLM-generated BPMN-XML validity.

Topics

Business Process Modeling Notation
Large Language Models
LLM Evaluation Frameworks
Text-to-Process Generation
Process Model Quality
BPMN-XML Validity

Code references

Best for: AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.