What the Question Parser Extracts from a User String: Keywords, Scope, Shape, Decomposition, Clarification

2026-06-17 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

The article details the extraction half of the question-parsing brick for an enterprise RAG system, outlining five field families the parser extracts from a user's query. It explains how the parser populates "question_df" with keywords, expected answer shape and type, scope hints, decomposition for compound questions, and clarification needs. Keywords are sourced from explicit user input, LLM rewrites, an expert concept dictionary, and anchor regex, with a focus on addressing vocabulary mismatch. The system tags answers by "answer_shape" (e.g., "single", "table") and "answer_type" (e.g., "amount", "date"), using regex for confirmation. Scope hints like page numbers or section names are extracted to filter retrieval. Compound questions are decomposed into sub-questions (e.g., "independent", "sequential"), and ambiguous queries trigger a "suggested_clarification" prompt to the user. This structured approach aims to produce a typed, relational brief for downstream RAG bricks.

Key takeaway

For MLOps Engineers building enterprise RAG systems, implementing a robust question parsing layer is crucial for accuracy and user trust. Your system should extract keywords, define expected answer shapes and types, and identify scope hints to refine retrieval. Decomposing compound questions and proactively seeking clarification for vague inputs will significantly reduce subtly wrong answers, improving overall system reliability and user satisfaction.

Key insights

Question parsing transforms user queries into a structured, relational brief for precise RAG retrieval and generation.

Principles

Explicit keyword extraction outperforms HyDE in bounded domains.
Tag questions by answer shape and type for robust retrieval.
Decompose compound questions into actionable sub-queries.

Method

The parser uses LLM calls for spelling correction, query rewriting, concept disambiguation, and hint extraction, alongside expert-curated satellite tables for keywords and answer types.

In practice

Maintain an expert dictionary for domain-specific synonyms.
Implement regex patterns for specific answer types like amounts or dates.
Use Pydantic for structured output from LLM parsing.

Topics

Question Parsing
Retrieval-Augmented Generation
Enterprise AI
LLM Query Rewriting
Semantic Parsing
Knowledge Extraction
Document Intelligence

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.