I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing

2026-04-21 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

A team initially deployed GPT-4 for document classification and structured field extraction in a nightly batch job, finding it highly effective for complex edge cases and varied formats. However, despite setting `temperature=0`, GPT-4 exhibited subtle non-determinism, leading to 23 pipeline failures over six weeks due to inconsistent JSON output formatting (e.g., key casing, markdown fences, string vs. null types). Attempts to fix this with extensive prompt engineering, cleanup parsers, and OpenAI's `response_format` and function calling features reduced failures but did not eliminate them. The team ultimately transitioned to local Small Language Models (SLMs) like Qwen2.5-7B-Instruct, hosted via Ollama, leveraging seeded inference for true determinism. This shift, while requiring more setup and a review queue for highly ambiguous documents, achieved 100% output consistency for the majority of documents, significantly improving pipeline reliability and reducing operational overhead.

Key takeaway

For MLOps Engineers managing automated data pipelines, relying on probabilistic frontier models like GPT-4 for structured extraction, even with `temperature=0`, introduces unacceptable non-determinism. You should evaluate local Small Language Models (SLMs) with seeded inference, such as Qwen2.5-7B-Instruct via Ollama, for tasks like structured reading comprehension. This approach provides true output consistency, eliminating unpredictable failures and significantly reducing debugging and operational overhead, even if it requires more initial setup and a separate handling for highly ambiguous edge cases.

Key insights

Probabilistic LLMs are unsuitable for deterministic batch processing, even with `temperature=0`.

Principles

Non-determinism in LLMs can cause subtle, hard-to-debug pipeline failures.
Local SLMs with seeded inference offer true output determinism.
Match model capabilities to task requirements; frontier models aren't always necessary.

Method

For deterministic structured data extraction, use local SLMs (e.g., Qwen2.5-7B-Instruct) with seeded inference via platforms like Ollama, and implement robust Pydantic validation with normalization.

In practice

Implement `response_format` and function calling for hosted LLMs.
Cache Ollama models in CI/CD to reduce job run times.
Use Pydantic `field_validator` with `mode="before"` for input normalization.

Topics

GPT-4 Non-Determinism
CI/CD Pipeline Stability
Local SLM Deployment
Ollama
Qwen2.5-7B-Instruct

Code references

ollama/ollama

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.