ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

2025-05-19 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Public Safety & Security, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

ARMOR 2025 is a new military-aligned safety benchmark designed to evaluate Large Language Models (LLMs) for defense applications, addressing a critical gap in existing benchmarks that primarily focus on civilian contexts. This benchmark is grounded in three core military doctrines: the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. It features a structured 12-category taxonomy, informed by the Observe-Orient-Decide-Act (OODA) decision-making framework, and comprises 519 doctrinally grounded multiple-choice questions. The benchmark was used to evaluate 21 commercial and open-source LLMs, revealing significant gaps in their safety alignment for military use, particularly in ethical and accountability-oriented categories. The findings were presented at the International Conference on Military Communication and Information Systems (ICMCIS) in May 2026.

Key takeaway

For research scientists developing LLMs for defense, you must integrate doctrine-grounded evaluation early in your development cycle. Relying solely on civilian safety metrics will lead to models that either hallucinate rules or refuse lawful requests, rendering them unreliable for mission-critical decision support. Implement layered assurance with specialized compliance checkers to flag doctrinal inconsistencies and ensure human oversight.

Key insights

Existing LLM safety benchmarks are insufficient for military applications, necessitating doctrine-aligned evaluation like ARMOR 2025.

Principles

Military LLM safety requires doctrinal compliance.
Refusal in military contexts is a critical failure mode.
OODA loop structures military decision-making.

Method

ARMOR 2025 uses a "Model-in-the-Loop" pipeline to generate 519 multiple-choice questions from military doctrine, validated by human review, and structured by the OODA framework for systematic LLM evaluation.

In practice

Evaluate LLMs against ARMOR 2025 for defense procurement.
Train lightweight verification models for doctrinal compliance.
Prioritize layered assurance in defense LLM systems.

Topics

ARMOR 2025
LLM Military Safety
Law of War
Rules of Engagement
Joint Ethics Regulation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.