Improving Answer Extraction in Context-based Question Answering Systems Using LLMs
Summary
This work presents a question answering (QA) system that enhances answer extraction from textual contexts by fine-tuning large language models (LLMs). The methodology involves adapting multiple transformer-based architectures on the Stanford Question Answering Dataset (SQuAD1.1), which comprises over 100,000 context-question-answer triplets from Wikipedia articles. The system formulates QA as an extractive problem, predicting start and end positions of answer spans, and optimizes models using a combined loss function for start and end probabilities. Experimental results, evaluated using ROUGE-L, BLEU, and BERTScore, demonstrate significant performance improvements across all fine-tuned models compared to their baselines. Notably, the Roberta-base model achieved the highest scores, with a ROUGE-L of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%, confirming the effectiveness of targeted fine-tuning for precise and context-grounded responses.
Key takeaway
For Machine Learning Engineers developing context-based question answering systems, fine-tuning pre-trained LLMs is critical for achieving high accuracy. You should prioritize models like Roberta-base, which demonstrated superior performance with ROUGE-L of 86.84% on SQuAD1.1. Relying solely on pre-trained knowledge will yield insufficient results; instead, focus on task-specific training to ensure reliable and precise answer extraction.
Key insights
Targeted fine-tuning of pre-trained LLMs significantly improves context-based question answering accuracy and relevance.
Principles
- Fine-tuning LLMs is crucial for high-performance QA.
- Model capacity impacts complex query handling.
- Pre-trained knowledge alone is insufficient for quality QA.
Method
The proposed QA system fine-tunes transformer-based LLMs on context-question-answer triplets. It formulates QA as an extractive task, predicting answer span start/end positions by minimizing a combined loss function.
In practice
- Fine-tune Roberta-base for extractive QA tasks.
- Use SQuAD1.1 for supervised QA training.
- Evaluate QA systems with ROUGE-L, BLEU, and BERTScore.
Topics
- Question Answering
- Large Language Models
- Fine-tuning
- Roberta-base
- SQuAD1.1
- Extractive QA
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.