ReSyn: A Generalized Recursive Regular Expression Synthesis Framework

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

ReSyn is a novel, synthesizer-agnostic framework designed to overcome the limitations of existing Programming-By-Example (PBE) systems in synthesizing complex, real-world regular expressions. Traditional benchmarks often simplify regex structures, leading to performance drops when models encounter deeper nesting and frequent Union operators found in practical applications like RegExLib, which features over 2x more AST nodes than simplified datasets. ReSyn employs a recursive, learnable divide-and-conquer strategy, decomposing synthesis problems into manageable sub-problems using specialized neural modules: Router, Partitioner, and Segmenter. The framework also introduces Set2Regex, a parameter-efficient (10M) base synthesizer with a Hierarchical Set Encoder. With a total of 29.6M parameters, ReSyn significantly boosts synthesis success rates and semantic accuracy across benchmarks, achieving new top performance on challenging real-world datasets like RegExLib, outperforming larger models like Prax (300M) and gpt-oss-120b (120B), and even GPT-5 on some metrics.

Key takeaway

For machine learning engineers developing regex synthesis systems, you should prioritize recursive decomposition strategies and permutation-invariant input encoding. This approach, exemplified by ReSyn, effectively tackles the NP-hard problem of optimal regex decomposition, significantly improving accuracy on complex, nested real-world patterns. Consider integrating learnable decomposition modules to adaptively break down problems, enabling more robust generalization than monolithic or fixed-heuristic models. Your systems will achieve superior performance and parameter efficiency.

Key insights

Real-world regex synthesis requires recursive decomposition and permutation-invariant encoding to handle structural complexity efficiently.

Principles

Method

ReSyn uses a three-stage framework: regex canonicalization, Set2Regex (a hierarchical encoder-decoder base synthesizer), and a recursive decomposition algorithm with learnable Router, Partitioner, and Segmenter modules.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.