REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange
Summary
REStack is a new, large-scale dataset comprising over 12,000 reverse engineering (RE) discussion posts collected from Stack Overflow and the dedicated Reverse Engineering Stack Exchange site, spanning from August 2008 to April 2025. This dataset, the first publicly available of its kind, systematically curates RE discussions and includes metadata and difficulty indicators like unanswered rates and response times. Utilizing Latent Dirichlet Allocation with Genetic Algorithm-based hyperparameter optimization and manual labeling, researchers identified 23 semantically coherent RE topics, categorized into six high-level themes. Analysis reveals that RE discussions are predominantly practical and task-oriented, emphasizing debugging, decompilation, and system-level analysis. Notably, topics concerning memory, firmware, and file format analysis exhibit high difficulty and unresolved rates, with unanswered rates ranging from 45.28% to 64.71% and median resolution times from 1 to 7 hours. REStack serves as a valuable resource for empirical studies, educational research, and benchmarking AI/LLM-based developer assistance tools for RE.
Key takeaway
For AI engineers developing developer assistance tools for reverse engineering, REStack offers a critical benchmark. You should utilize its 23 categorized topics and difficulty indicators to evaluate LLMs on low-level reasoning, especially against the 2,513 unanswered questions. Educators can use the dataset's identified knowledge gaps, like memory and firmware analysis, to refine RE curricula and create targeted learning resources.
Key insights
REStack provides a large, structured dataset of reverse engineering discussions to understand challenges and support AI tool development.
Principles
- RE discussions are highly practical and task-oriented.
- Low-level system interactions pose significant RE challenges.
- Unanswered rates are a stronger difficulty proxy than resolution time.
Method
REStack was constructed by collecting 12,000+ posts from Stack Exchange, identifying RE-related tags using TRT/TST heuristics, preprocessing text, and applying GA-optimized LDA for topic modeling, followed by manual labeling.
In practice
- Benchmark LLMs on low-level software reasoning tasks.
- Train retrieval-augmented generation (RAG) systems for RE.
- Identify RE knowledge gaps for curriculum design.
Topics
- Reverse Engineering
- Stack Exchange Data
- Topic Modeling
- LLM Benchmarking
- Cybersecurity
- Software Engineering Education
Best for: AI Scientist, Research Scientist, AI Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.