REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Cybersecurity & Data Privacy, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

REStack is a new, large-scale dataset comprising over 12,000 reverse engineering (RE) discussion posts collected from Stack Overflow and the dedicated Reverse Engineering Stack Exchange site, spanning from August 2008 to April 2025. This dataset, the first publicly available of its kind, systematically curates RE discussions and includes metadata and difficulty indicators like unanswered rates and response times. Utilizing Latent Dirichlet Allocation with Genetic Algorithm-based hyperparameter optimization and manual labeling, researchers identified 23 semantically coherent RE topics, categorized into six high-level themes. Analysis reveals that RE discussions are predominantly practical and task-oriented, emphasizing debugging, decompilation, and system-level analysis. Notably, topics concerning memory, firmware, and file format analysis exhibit high difficulty and unresolved rates, with unanswered rates ranging from 45.28% to 64.71% and median resolution times from 1 to 7 hours. REStack serves as a valuable resource for empirical studies, educational research, and benchmarking AI/LLM-based developer assistance tools for RE.

Key takeaway

For AI engineers developing developer assistance tools for reverse engineering, REStack offers a critical benchmark. You should utilize its 23 categorized topics and difficulty indicators to evaluate LLMs on low-level reasoning, especially against the 2,513 unanswered questions. Educators can use the dataset's identified knowledge gaps, like memory and firmware analysis, to refine RE curricula and create targeted learning resources.

Key insights

REStack provides a large, structured dataset of reverse engineering discussions to understand challenges and support AI tool development.

Principles

RE discussions are highly practical and task-oriented.
Low-level system interactions pose significant RE challenges.
Unanswered rates are a stronger difficulty proxy than resolution time.

Method

REStack was constructed by collecting 12,000+ posts from Stack Exchange, identifying RE-related tags using TRT/TST heuristics, preprocessing text, and applying GA-optimized LDA for topic modeling, followed by manual labeling.

In practice

Benchmark LLMs on low-level software reasoning tasks.
Train retrieval-augmented generation (RAG) systems for RE.
Identify RE knowledge gaps for curriculum design.

Topics

Reverse Engineering
Stack Exchange Data
Topic Modeling
LLM Benchmarking
Cybersecurity
Software Engineering Education

Best for: AI Scientist, Research Scientist, AI Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.