Neurosymbolic Repo-level Code Localization
Summary
LogicLoc is a novel neurosymbolic agentic framework designed for repository-level code localization, addressing a critical "Keyword Shortcut" bias in existing benchmarks. This bias causes models to rely on superficial lexical matching rather than genuine structural reasoning. LogicLoc formalizes Keyword-Agnostic Logical Code Localization (KA-LCL) and introduces KA-LogicQuery, a diagnostic benchmark of 225 logic-intensive queries across 9 projects, requiring structural reasoning without naming hints. The framework combines large language models (LLMs) with Datalog's logical reasoning, extracting program facts from codebases and using LLMs to synthesize Datalog programs. These programs undergo parser-gated validation and mutation-based intermediate-rule diagnostic feedback before execution by a high-performance inference engine. LogicLoc significantly outperforms state-of-the-art methods on KA-LogicQuery, achieving a 48.44% Perfect Location Rate at the file level and 38.27% at the function level, while maintaining competitive performance on issue-driven benchmarks like SWE-bench Lite with lower token consumption and faster execution.
Key takeaway
For Machine Learning Engineers and Research Scientists developing autonomous software engineering agents, LogicLoc demonstrates that integrating LLMs with formal logic systems like Datalog is crucial for robust code localization. You should prioritize solutions that emphasize structural reasoning over lexical matching, especially for keyword-agnostic tasks. Consider adopting neurosymbolic architectures to achieve verifiable, high-precision results and reduce the risk of hallucinated code locations, thereby improving efficiency and reliability in production environments.
Key insights
Neurosymbolic AI combining LLMs and Datalog overcomes keyword shortcuts for precise code localization.
Principles
- Keyword shortcuts inflate code localization performance.
- Deterministic reasoning is crucial for structural code understanding.
- Hybrid neurosymbolic approaches enhance AI reliability.
Method
LogicLoc extracts program facts, synthesizes Datalog queries via LLMs, and refines them using parser-gated validation and mutation-based intermediate-rule diagnostics for precise, verifiable code localization.
In practice
- Use KA-LogicQuery to test true structural reasoning.
- Implement Datalog for complex code pattern matching.
- Apply parser-gated validation for LLM-generated code.
Topics
- Neurosymbolic AI
- Code Localization
- Datalog Programming
- Large Language Models
- Keyword Shortcut Bias
Best for: Machine Learning Engineer, Research Scientist, AI Scientist, AI Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.