Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

· Source: cs.CL updates on arXiv.org · Field: Manufacturing & Industrial — Smart Manufacturing & Industry 4.0, Sector-Specific Manufacturing · Depth: Expert, extended

Summary

A study benchmarked four 7B–8B parameter open-source LLMs—Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B—for natural-language-to-T-SQL querying in biopharmaceutical manufacturing. Deployed locally on consumer-grade hardware via Ollama, the models were tested against a synthetic MS SQL Server database (7 tables, ~63,000 rows) using 60 domain-specific questions. Llama 3.1 8B achieved 93.33% SQL Compliance, slightly outperforming Qwen 2.5 Coder 7B (88.33%), though this difference was not statistically significant (p=0.529). Meditron 7B, a biomedical-tuned model, failed completely (0% compliance) due to context window limitations and an inability to generate valid T-SQL. Mistral 7B exhibited sensitivity to schema complexity. The findings indicate that code-tuned general-purpose models are more effective than domain-specific medical LLMs for this task, and while local GxP-aligned NLQ pipelines are feasible, human review remains critical due to modest Factual Consistency scores (Qwen 0.34, Llama 0.30).

Key takeaway

For ML Engineers deploying natural language to SQL systems in biopharmaceutical manufacturing, you should prioritize code-tuned general-purpose LLMs like Llama 3.1 8B or Qwen 2.5 Coder 7B. Your deployment must include robust human review and a downstream validation layer, as current 7B–8B models are query-drafting tools, not autonomous generators. This ensures GxP compliance and prevents misleading operational decisions.

Key insights

Code-tuned general-purpose LLMs outperform domain-specific medical LLMs for T-SQL generation on manufacturing schemas.

Principles

Method

A local FastAPI-based platform, PharmaBatchDB AI, was developed using Ollama, a synthetic MS SQL Server, and sqlglot AST validation for NLQ-to-T-SQL benchmarking.

In practice

Topics

Code references

Best for: AI Architect, AI Engineer, CTO, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.