I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing
Summary
A team initially deployed GPT-4 for document classification and structured field extraction in a nightly batch job, finding it highly effective for complex edge cases and varied formats. However, despite setting `temperature=0`, GPT-4 exhibited subtle non-determinism, leading to 23 pipeline failures over six weeks due to inconsistent JSON output formatting (e.g., key casing, markdown fences, string vs. null types). Attempts to fix this with extensive prompt engineering, cleanup parsers, and OpenAI's `response_format` and function calling features reduced failures but did not eliminate them. The team ultimately transitioned to local Small Language Models (SLMs) like Qwen2.5-7B-Instruct, hosted via Ollama, leveraging seeded inference for true determinism. This shift, while requiring more setup and a review queue for highly ambiguous documents, achieved 100% output consistency for the majority of documents, significantly improving pipeline reliability and reducing operational overhead.
Key takeaway
For MLOps Engineers managing automated data pipelines, relying on probabilistic frontier models like GPT-4 for structured extraction, even with `temperature=0`, introduces unacceptable non-determinism. You should evaluate local Small Language Models (SLMs) with seeded inference, such as Qwen2.5-7B-Instruct via Ollama, for tasks like structured reading comprehension. This approach provides true output consistency, eliminating unpredictable failures and significantly reducing debugging and operational overhead, even if it requires more initial setup and a separate handling for highly ambiguous edge cases.
Key insights
Probabilistic LLMs are unsuitable for deterministic batch processing, even with `temperature=0`.
Principles
- Non-determinism in LLMs can cause subtle, hard-to-debug pipeline failures.
- Local SLMs with seeded inference offer true output determinism.
- Match model capabilities to task requirements; frontier models aren't always necessary.
Method
For deterministic structured data extraction, use local SLMs (e.g., Qwen2.5-7B-Instruct) with seeded inference via platforms like Ollama, and implement robust Pydantic validation with normalization.
In practice
- Implement `response_format` and function calling for hosted LLMs.
- Cache Ollama models in CI/CD to reduce job run times.
- Use Pydantic `field_validator` with `mode="before"` for input normalization.
Topics
- GPT-4 Non-Determinism
- CI/CD Pipeline Stability
- Local SLM Deployment
- Ollama
- Qwen2.5-7B-Instruct
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.