Designing a Multi-Agent System for Engineering Support at Scale: A Case Study From Grab

· Source: InfoQ · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, quick

Summary

Grab's Analytics Data Warehouse (ADW) team has deployed a multi-agent AI system to automate engineering support workflows, aiming to reduce repetitive operational tasks and enhance resolution efficiency across its platform. Supporting over 1,000 internal users and managing more than 15,000 tables, the ADW team faced significant operational overhead from ad hoc investigations and support requests. The new system employs a multi-agent architecture, separating requests into "investigation" for diagnostics like query analysis and "enhancement" for generating actionable outputs such as SQL fixes and automated merge requests. Orchestrated by a LangGraph-based workflow engine and FastAPI services, the system routes requests to specialized agents. Key design choices included consolidating over 30 internal tools into a curated set and integrating safety measures like SQL validation and human-in-the-loop review for code changes. This initiative has shifted engineering effort towards higher-value platform improvement.

Key takeaway

For MLOps Engineers scaling data platforms, consider implementing a multi-agent system to automate repetitive support tasks. You can free up significant engineering bandwidth by separating diagnostic and enhancement workflows, similar to Grab's approach. Consolidate your toolset and integrate human-in-the-loop reviews for critical outputs like code changes. This strategy shifts your team from reactive firefighting to proactive system development, improving overall platform efficiency and innovation capacity.

Key insights

Grab's multi-agent AI system automates engineering support, freeing up engineers for higher-value platform development by streamlining diagnostic and enhancement workflows.

Principles

Method

The system classifies requests, routes them to specialized agents, and orchestrates tasks using a LangGraph-based engine with FastAPI services for coordination, tool execution, and state management. Context is managed via compression and selective retrieval.

In practice

Topics

Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Data Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.