Understanding Automated Web GUI Testing: An Empirical Study Across Exploration Strategies and State Abstractions

2026-06-16 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

An empirical study on automated web GUI testing (AWGT) analyzed the joint impact of exploration strategies and state abstractions on testing effectiveness, using code coverage and failure revelation as metrics. The study compared five AWGT approaches—Crawljax, FragGen, WebExplor, WebRLED, and GPTWeb—across model-based, reinforcement learning (RL)–based, and large language model (LLM)–based categories. It investigated six state abstraction techniques for model-based and RL-based approaches, and four history representations for LLM-based approaches, on six open-source web applications with a 30-minute time budget. Findings indicate that no single strategy consistently outperforms others, with RL-based approaches generally achieving the highest code coverage. Strict, fine-grained state abstractions benefit model-based strategies, while compact abstractions support RL-based ones. For LLM-based approaches, concise, functionality-level history representations proved most effective, with verbose state histories degrading performance and increasing costs by nearly 20x for unlimited context. The study also found no strong correlation between code coverage and failure-revealing ability.

Key takeaway

For AI Engineers designing or selecting automated web GUI testing (AWGT) solutions, understand that no single approach is universally superior. You should align your state abstraction technique with your chosen exploration strategy: strict, fine-grained abstractions for model-based systems, and compact ones for RL-based systems. When using LLM-based AWGT, prioritize concise, functionality-level history representations over verbose state descriptions to improve effectiveness and manage costs. Always evaluate your AWGT solutions using both code coverage and failure-revealing capabilities, as these metrics are not strongly correlated.

Key insights

The effectiveness of automated web GUI testing hinges on matching exploration strategies with appropriate state abstraction techniques.

Principles

No single AWGT strategy consistently excels.
State abstraction is critical for testing effectiveness.
Code coverage does not correlate with failure revelation.

Method

An empirical study compared model-based, RL-based, and LLM-based AWGT approaches, integrating various state abstraction and history representation techniques, evaluating code coverage and failure revelation on web applications.

In practice

Match state abstraction to exploration strategy.
Prioritize functionality history for LLM-based AWGT.
Evaluate AWGT using both coverage and failure metrics.

Topics

Web GUI Testing
Test Exploration Strategies
GUI State Abstraction
Reinforcement Learning
Large Language Models
Test Effectiveness Evaluation

Code references

Best for: Research Scientist, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.