Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
Summary
The Chat2Workflow benchmark has been introduced to evaluate large language models' (LLMs) ability to automate the generation of executable visual workflows from natural language. Currently, these workflows, prevalent in industrial deployments for their reliability and controllability, are manually engineered, a process that is costly, time-consuming, and prone to errors. Chat2Workflow comprises real-world business workflows, designed for direct deployment on platforms like Dify and Coze. Initial experiments reveal that while LLMs grasp high-level intent, they struggle with generating correct, stable, and executable workflows, particularly for complex or evolving requirements. An agentic framework proposed alongside the benchmark improved the resolve rate by up to 5.34%, yet a significant gap remains for industrial-grade automation.
Key takeaway
For research scientists developing automation solutions, the Chat2Workflow benchmark highlights a critical gap in LLM capabilities for generating industrial-grade executable visual workflows. You should focus your efforts on improving LLM stability and correctness under complex, evolving requirements, potentially by integrating agentic frameworks to enhance reliability and reduce execution errors.
Key insights
Automating executable visual workflow generation from natural language remains a significant challenge for current LLMs.
Principles
- Manual workflow engineering is costly.
- LLMs capture high-level intent well.
- Complex requirements challenge LLM workflow generation.
Method
The Chat2Workflow benchmark uses real-world business workflows to test LLMs' ability to generate deployable visual workflows from natural language, complemented by an agentic framework to reduce execution errors.
In practice
- Deploy workflows on Dify.
- Deploy workflows on Coze.
- Use agentic frameworks for error mitigation.
Topics
- Chat2Workflow
- Executable Visual Workflows
- Natural Language Generation
- Large Language Models
- Agentic Framework
Code references
Best for: Research Scientist, AI Scientist, AI Engineer, Automation Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.