Neurosymbolic Repo-level Code Localization

2025-10-04 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

LogicLoc is a novel neurosymbolic agentic framework designed for repository-level code localization, addressing a critical "Keyword Shortcut" bias in existing benchmarks. This bias causes models to rely on superficial lexical matching rather than genuine structural reasoning. LogicLoc formalizes Keyword-Agnostic Logical Code Localization (KA-LCL) and introduces KA-LogicQuery, a diagnostic benchmark of 225 logic-intensive queries across 9 projects, requiring structural reasoning without naming hints. The framework combines large language models (LLMs) with Datalog's logical reasoning, extracting program facts from codebases and using LLMs to synthesize Datalog programs. These programs undergo parser-gated validation and mutation-based intermediate-rule diagnostic feedback before execution by a high-performance inference engine. LogicLoc significantly outperforms state-of-the-art methods on KA-LogicQuery, achieving a 48.44% Perfect Location Rate at the file level and 38.27% at the function level, while maintaining competitive performance on issue-driven benchmarks like SWE-bench Lite with lower token consumption and faster execution.

Key takeaway

For Machine Learning Engineers and Research Scientists developing autonomous software engineering agents, LogicLoc demonstrates that integrating LLMs with formal logic systems like Datalog is crucial for robust code localization. You should prioritize solutions that emphasize structural reasoning over lexical matching, especially for keyword-agnostic tasks. Consider adopting neurosymbolic architectures to achieve verifiable, high-precision results and reduce the risk of hallucinated code locations, thereby improving efficiency and reliability in production environments.

Key insights

Neurosymbolic AI combining LLMs and Datalog overcomes keyword shortcuts for precise code localization.

Principles

Keyword shortcuts inflate code localization performance.
Deterministic reasoning is crucial for structural code understanding.
Hybrid neurosymbolic approaches enhance AI reliability.

Method

LogicLoc extracts program facts, synthesizes Datalog queries via LLMs, and refines them using parser-gated validation and mutation-based intermediate-rule diagnostics for precise, verifiable code localization.

In practice

Use KA-LogicQuery to test true structural reasoning.
Implement Datalog for complex code pattern matching.
Apply parser-gated validation for LLM-generated code.

Topics

Neurosymbolic AI
Code Localization
Datalog Programming
Large Language Models
Keyword Shortcut Bias

Best for: Machine Learning Engineer, Research Scientist, AI Scientist, AI Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.