We Tested 25 Local LLMs for Medical Use. Here’s What Shipped.
Summary
Meda AI's process for selecting local LLMs for an on-premise medical AI assistant involved extensive testing of 25 models over five days, focusing on data privacy and patient safety for tasks like SOAP note generation, ICD-10-GM coding, and billing. They found that a Mac Studio M3 Ultra with 96 GB unified memory is viable for multi-model serving, achieving 110-205 tok/s with concurrency, while a Threadripper 7980X + NVIDIA 5090 (32GB VRAM) suits single-doctor deployments at 116 tok/s. Models were evaluated across three distinct stages, demanding different behaviors, using German-language transcripts. Initial screening filtered candidates by memory footprint (<25 GB), German proficiency, commercial license, and OpenAI-compatible serving. The final production stack utilizes Gemma 4 E4B for SOAP and billing, Gemma 3 4B for patient summaries, and Qwen3-30B A3B for ICD coding, consuming approximately 33 GB VRAM on an M3 Ultra.
Key takeaway
For AI Engineers building on-premise medical AI assistants, you should adopt a multi-model, multi-stage architecture to balance performance and safety. Prioritize GGUF runtimes for broader model compatibility over MLX-specific optimizations, and carefully evaluate models for specific hallucination types rather than just speed. Consider sharing model slots for similar structured extraction tasks like SOAP generation and billing to optimize memory usage on hardware like the M3 Ultra.
Key insights
Local LLMs are viable for medical AI, but require careful selection and a multi-model, multi-stage architecture.
Principles
- Multi-layer model architecture is essential for diverse clinical tasks.
- Runtime portability (e.g., GGUF) often outweighs MLX-specific optimizations.
- Hallucination types matter; grade by severity, not just rate.
Method
The process involved breaking medical AI tasks into three stages, screening 25 LLMs by hard constraints, and benchmarking survivors on hallucination rate, tokens/second, and memory footprint, then pivoting runtimes.
In practice
- Use Gemma 4 E4B GGUF for zero-hallucination SOAP extraction.
- Deploy Qwen3-30B A3B for robust ICD-10-GM coding.
- Reuse Gemma 4 for both SOAP and billing tasks.
Topics
- Local LLMs
- Medical AI
- On-premise Deployment
- LLM Benchmarking
- GGUF Runtime
- ICD-10-GM Coding
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.