Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis
Summary
A multi-field hybrid Retrieval-Augmented Generation (RAG) framework is proposed for automating maritime accident Root Cause Analysis (RCA). This system utilizes a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports from 1971–2025, transforming them into structured "incident cards" with distinct Summary, Causes, and Disposition fields, alongside a hierarchical L1/L2 cause taxonomy. The retrieval strategy employs a field-aware hybrid approach, fusing sparse (BM25) and dense (bge-m3) rankings via Reciprocal Rank Fusion (RRF). Experimental results show the proposed retrieval significantly outperforms baselines, improving NormRecall@100 from 0.18 to 0.55. Grounding the Qwen3-Next-80B-A3B-Instruct generator on retrieved precedents enhances RCA generation quality, increasing the LLM-as-a-judge score from 3.34 to 3.72. This framework streamlines maritime safety investigations by enabling faster precedent search and more consistent, evidence-based RCA drafting.
Key takeaway
For AI Scientists developing RAG systems for specialized, multi-field documents like legal or technical reports, you should adopt a field-aware hybrid retrieval strategy. This approach, which separates document sections like "Summary" and "Causes" and fuses sparse and dense retrieval, significantly improves factual consistency and reduces LLM hallucinations. Implementing this can streamline your workflows, enabling more accurate and evidence-grounded automated analysis compared to monolithic document indexing.
Key insights
Structuring RAG architectures around domain-specific document fields significantly enhances downstream generation quality and factual consistency.
Principles
- Field-aware indexing prevents signal dilution.
- Hybrid retrieval combines lexical and semantic strengths.
- Metadata proxies enable large-scale evaluation.
Method
Convert raw documents into "incident cards" with Summary, Causes, Disposition fields and L1/L2 tags. Apply field-aware hybrid retrieval (BM25 + bge-m3 via RRF). Ground an LLM (Qwen3-Next-80B-A3B-Instruct) on retrieved chunks to generate structured RCA.
In practice
- Partition multi-field documents for RAG indexing.
- Fuse sparse and dense retrieval for specialized terms.
- Develop metadata-based proxies for relevance scoring.
Topics
- Retrieval-Augmented Generation
- Maritime Accident Analysis
- Root Cause Analysis
- Hybrid Retrieval
- Knowledge Base Construction
- Large Language Models
Best for: AI Scientist, Research Scientist, Domain Expert
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.