Specializing a Small Language Model for Closed-Domain Portuguese RAG using Knowledge Graph Supervision
Summary
A study presented at PROPOR 2026 details a methodology for fine-tuning a small language model (SLM) for closed-domain Portuguese question answering (QA) using Retrieval-Augmented Generation (RAG) logic. The model was trained to select the most relevant context passage from ten candidates to generate an answer. Fine-tuning data originated from PetroKGraph, a knowledge graph derived from Portuguese oil and gas (O&G) resources. Experimental results demonstrate that the specialized SLM achieved a 20 percentage point accuracy improvement over its base model on closed-domain questions. It also outperformed GPT-4o and GPT-4o Mini by 12 and 25 points, respectively, while retaining comparable performance on general-domain tasks, indicating effective domain-specific knowledge acquisition without losing general reasoning capabilities.
Key takeaway
For AI Engineers developing specialized QA systems in resource-constrained or domain-specific environments, consider fine-tuning small language models with knowledge graph supervision. This approach can yield superior accuracy compared to larger, general-purpose models like GPT-4o, particularly for languages like Portuguese in sectors such as oil and gas. Your team should explore creating domain-specific knowledge graphs to generate high-quality fine-tuning data, potentially reducing inference costs and improving relevance.
Key insights
Fine-tuning SLMs with knowledge graph supervision significantly enhances closed-domain RAG performance, surpassing larger LLMs.
Principles
- SLMs can outperform LLMs in specialized domains.
- Domain-specific fine-tuning preserves general reasoning.
Method
A small language model is fine-tuned for closed-domain QA by training it to select the most relevant context passage from ten candidates, mimicking RAG, using knowledge graph-derived data.
In practice
- Use knowledge graphs for domain-specific fine-tuning data.
- Implement RAG logic for closed-domain QA with SLMs.
Topics
- Small Language Models
- Retrieval-Augmented Generation
- Knowledge Graph Supervision
- Closed-Domain QA
- Portuguese Language Processing
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.