Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Manufacturing & Industrial — Smart Manufacturing & Industry 4.0, Sector-Specific Manufacturing · Depth: Expert, extended

Summary

A study benchmarked four 7B–8B parameter open-source LLMs—Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B—for natural-language-to-T-SQL querying in biopharmaceutical manufacturing. Deployed locally on consumer-grade hardware via Ollama, the models were tested against a synthetic MS SQL Server database (7 tables, ~63,000 rows) using 60 domain-specific questions. Llama 3.1 8B achieved 93.33% SQL Compliance, slightly outperforming Qwen 2.5 Coder 7B (88.33%), though this difference was not statistically significant (p=0.529). Meditron 7B, a biomedical-tuned model, failed completely (0% compliance) due to context window limitations and an inability to generate valid T-SQL. Mistral 7B exhibited sensitivity to schema complexity. The findings indicate that code-tuned general-purpose models are more effective than domain-specific medical LLMs for this task, and while local GxP-aligned NLQ pipelines are feasible, human review remains critical due to modest Factual Consistency scores (Qwen 0.34, Llama 0.30).

Key takeaway

For ML Engineers deploying natural language to SQL systems in biopharmaceutical manufacturing, you should prioritize code-tuned general-purpose LLMs like Llama 3.1 8B or Qwen 2.5 Coder 7B. Your deployment must include robust human review and a downstream validation layer, as current 7B–8B models are query-drafting tools, not autonomous generators. This ensures GxP compliance and prevents misleading operational decisions.

Key insights

Code-tuned general-purpose LLMs outperform domain-specific medical LLMs for T-SQL generation on manufacturing schemas.

Principles

Domain-specific pre-training does not guarantee T-SQL generation capability.
Context window capacity is critical for schema comprehension in NLQ tasks.
Human review is essential for LLM-generated SQL in regulated environments.

Method

A local FastAPI-based platform, PharmaBatchDB AI, was developed using Ollama, a synthetic MS SQL Server, and sqlglot AST validation for NLQ-to-T-SQL benchmarking.

In practice

Prioritize LLMs with strong code-tuning over biomedical pre-training for SQL tasks.
Implement AST-level SQL validation and audit logging for GxP compliance.
Consider hardware investment to mitigate "shadow AI" risks from slow local systems.

Topics

Local LLMs
Natural Language to SQL
Biopharmaceutical Manufacturing
GxP Compliance
Ollama
LLM Benchmarking
Data Sovereignty

Code references

sbhetwal11/Benchmarking-Local-Large-Language-Models-for-Natural-Language-to-SQL

Best for: AI Architect, AI Engineer, CTO, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.