Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

The Chat2Workflow benchmark has been introduced to evaluate large language models' (LLMs) ability to automate the generation of executable visual workflows from natural language. Currently, these workflows, prevalent in industrial deployments for their reliability and controllability, are manually engineered, a process that is costly, time-consuming, and prone to errors. Chat2Workflow comprises real-world business workflows, designed for direct deployment on platforms like Dify and Coze. Initial experiments reveal that while LLMs grasp high-level intent, they struggle with generating correct, stable, and executable workflows, particularly for complex or evolving requirements. An agentic framework proposed alongside the benchmark improved the resolve rate by up to 5.34%, yet a significant gap remains for industrial-grade automation.

Key takeaway

For research scientists developing automation solutions, the Chat2Workflow benchmark highlights a critical gap in LLM capabilities for generating industrial-grade executable visual workflows. You should focus your efforts on improving LLM stability and correctness under complex, evolving requirements, potentially by integrating agentic frameworks to enhance reliability and reduce execution errors.

Key insights

Automating executable visual workflow generation from natural language remains a significant challenge for current LLMs.

Principles

Method

The Chat2Workflow benchmark uses real-world business workflows to test LLMs' ability to generate deployable visual workflows from natural language, complemented by an agentic framework to reduce execution errors.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, AI Engineer, Automation Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.