Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware
Summary
A study benchmarked four 7B–8B parameter open-source LLMs—Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B—for natural-language-to-T-SQL querying in biopharmaceutical manufacturing. Deployed locally on consumer-grade hardware via Ollama, the models were tested against a synthetic MS SQL Server database (7 tables, ~63,000 rows) using 60 domain-specific questions. Llama 3.1 8B achieved 93.33% SQL Compliance, slightly outperforming Qwen 2.5 Coder 7B (88.33%), though this difference was not statistically significant (p=0.529). Meditron 7B, a biomedical-tuned model, failed completely (0% compliance) due to context window limitations and an inability to generate valid T-SQL. Mistral 7B exhibited sensitivity to schema complexity. The findings indicate that code-tuned general-purpose models are more effective than domain-specific medical LLMs for this task, and while local GxP-aligned NLQ pipelines are feasible, human review remains critical due to modest Factual Consistency scores (Qwen 0.34, Llama 0.30).
Key takeaway
For ML Engineers deploying natural language to SQL systems in biopharmaceutical manufacturing, you should prioritize code-tuned general-purpose LLMs like Llama 3.1 8B or Qwen 2.5 Coder 7B. Your deployment must include robust human review and a downstream validation layer, as current 7B–8B models are query-drafting tools, not autonomous generators. This ensures GxP compliance and prevents misleading operational decisions.
Key insights
Code-tuned general-purpose LLMs outperform domain-specific medical LLMs for T-SQL generation on manufacturing schemas.
Principles
- Domain-specific pre-training does not guarantee T-SQL generation capability.
- Context window capacity is critical for schema comprehension in NLQ tasks.
- Human review is essential for LLM-generated SQL in regulated environments.
Method
A local FastAPI-based platform, PharmaBatchDB AI, was developed using Ollama, a synthetic MS SQL Server, and sqlglot AST validation for NLQ-to-T-SQL benchmarking.
In practice
- Prioritize LLMs with strong code-tuning over biomedical pre-training for SQL tasks.
- Implement AST-level SQL validation and audit logging for GxP compliance.
- Consider hardware investment to mitigate "shadow AI" risks from slow local systems.
Topics
- Local LLMs
- Natural Language to SQL
- Biopharmaceutical Manufacturing
- GxP Compliance
- Ollama
- LLM Benchmarking
- Data Sovereignty
Code references
Best for: AI Architect, AI Engineer, CTO, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.