CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

2025-09-25 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

CTIArena is introduced as the first benchmark for evaluating large language models (LLMs) on heterogeneous, multi-source cyber threat intelligence (CTI) in knowledge-augmented settings. This benchmark addresses limitations of prior efforts by covering nine tasks across structured, unstructured, and hybrid CTI categories, comprising 691 high-quality QA pairs. Evaluation of ten widely used LLMs, including proprietary models like GPT-5 and open-source models like LLaMA-3-405B, revealed that most LLMs perform poorly in closed-book scenarios. However, they show noticeable performance gains when augmented with security-specific knowledge through techniques like CSKG-guided RAG and query-expanded RAG. These findings underscore that scaling model size alone is insufficient for CTI; domain-tailored knowledge augmentation is crucial.

Key takeaway

For AI Scientists and Machine Learning Engineers developing CTI solutions, you should prioritize integrating domain-specific knowledge augmentation over relying solely on larger, general-purpose LLMs. Implement tailored retrieval-augmented generation (RAG) strategies, such as CSKG-guided RAG for unstructured data or query-expanded RAG for hybrid tasks, to significantly improve performance and reduce hallucinations. This approach is critical for building robust CTI copilots that can effectively reason across diverse and fragmented intelligence sources.

Key insights

LLMs require domain-specific knowledge augmentation and tailored retrieval strategies for effective cyber threat intelligence analysis.

Principles

Structured CTI tasks achieve near-perfect accuracy with external knowledge.
Hybrid CTI tasks demand precise knowledge retrieval and grounding.
Unstructured CTI performance hinges on cross-report synthesis.

Method

CTIArena uses a three-stage pipeline: seed correlation annotation, factually-grounded QA synthesis via templates, and LLM-human collaborative curation for quality control.

In practice

Implement CSKG-guided RAG for unstructured CTI synthesis.
Apply query-expanded RAG for hybrid CTI tasks to align narratives.
Inject authoritative CTI entries for structured reasoning tasks.

Topics

Cyber Threat Intelligence
Large Language Models
Retrieval-Augmented Generation
CTI Benchmarking
MITRE ATT&CK
Cybersecurity Knowledge Graphs

Code references

peng-gao-lab/CTIArena

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.