SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Gaming & Interactive Media · Depth: Expert, extended

Summary

SAGE is a semantic-aware gray-box regression testing framework designed for modern live-service games, addressing limitations in manual test case construction, suite maintenance, and test prioritization. It employs LLM-guided reinforcement learning for efficient, goal-oriented exploration to automatically generate diverse foundational test suites. Subsequently, SAGE uses semantic-based multi-objective optimization to refine this suite into a compact, high-value subset by balancing cost, coverage, and rarity. Finally, it utilizes LLM-based semantic analysis of update logs to prioritize test cases relevant to version changes. Evaluated on Overcooked Plus and Minecraft, SAGE achieved superior bug detection with significantly lower execution cost, detecting approximately 1.6x more unique bugs than automated baselines and reducing execution time by 75-90%. It uses GPT-4o for LLM tasks.

Key takeaway

For game QA teams managing live-service titles, SAGE offers a robust solution to automate regression testing in gray-box environments. You can significantly reduce manual effort and execution costs by adopting its LLM-guided test generation and semantic-aware prioritization. Consider implementing multi-objective optimization to maintain a compact, high-value test suite that adapts efficiently to frequent game updates, ensuring critical bug detection without excessive overhead.

Key insights

SAGE uses LLMs to orchestrate semantic-aware gray-box game regression testing, improving bug detection and efficiency.

Principles

LLMs guide RL for efficient exploration.
Multi-objective optimization refines test suites.
Semantic analysis prioritizes relevant tests.

Method

SAGE generates seed trajectories with LLMs, trains an RL agent for guided exploration, constructs a state-action graph, optimizes paths via multi-objective selection, and prioritizes tests using LLM-extracted update log semantics.

In practice

Use LLMs to interpret natural language logs.
Apply Pareto optimization for test suite reduction.
Prioritize tests by semantic relevance and complexity.

Topics

Game Regression Testing
Large Language Models
Gray-box Testing
Reinforcement Learning
Multi-objective Optimization
Test Case Prioritization

Code references

BlueLinkX/SAGE

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.