Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

This work presents a question answering (QA) system that enhances answer extraction from textual contexts by fine-tuning large language models (LLMs). The methodology involves adapting multiple transformer-based architectures on the Stanford Question Answering Dataset (SQuAD1.1), which comprises over 100,000 context-question-answer triplets from Wikipedia articles. The system formulates QA as an extractive problem, predicting start and end positions of answer spans, and optimizes models using a combined loss function for start and end probabilities. Experimental results, evaluated using ROUGE-L, BLEU, and BERTScore, demonstrate significant performance improvements across all fine-tuned models compared to their baselines. Notably, the Roberta-base model achieved the highest scores, with a ROUGE-L of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%, confirming the effectiveness of targeted fine-tuning for precise and context-grounded responses.

Key takeaway

For Machine Learning Engineers developing context-based question answering systems, fine-tuning pre-trained LLMs is critical for achieving high accuracy. You should prioritize models like Roberta-base, which demonstrated superior performance with ROUGE-L of 86.84% on SQuAD1.1. Relying solely on pre-trained knowledge will yield insufficient results; instead, focus on task-specific training to ensure reliable and precise answer extraction.

Key insights

Targeted fine-tuning of pre-trained LLMs significantly improves context-based question answering accuracy and relevance.

Principles

Fine-tuning LLMs is crucial for high-performance QA.
Model capacity impacts complex query handling.
Pre-trained knowledge alone is insufficient for quality QA.

Method

The proposed QA system fine-tunes transformer-based LLMs on context-question-answer triplets. It formulates QA as an extractive task, predicting answer span start/end positions by minimizing a combined loss function.

In practice

Fine-tune Roberta-base for extractive QA tasks.
Use SQuAD1.1 for supervised QA training.
Evaluate QA systems with ROUGE-L, BLEU, and BERTScore.

Topics

Question Answering
Large Language Models
Fine-tuning
Roberta-base
SQuAD1.1
Extractive QA

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.