A 103B medical LLM just got open sourced — and it only activates 6.1B parameters at inference time [Meet AntAngelMed]
Summary
AntAngelMed is a newly open-sourced 103B-parameter medical Large Language Model (LLM) that utilizes a 1/32 activation-ratio Mixture-of-Experts (MoE) architecture built on Ling-flash-2.0. This design allows it to activate only 6.1B parameters during inference, maintaining inference costs proportional to a 6.1B model while accessing the knowledge capacity of 103B parameters. The model was trained in three stages: continual pre-training on medical corpora, Supervised Fine-Tuning (SFT) with mixed general and clinical instruction data, and GRPO-based reinforcement learning with task-specific reward models for safety, diagnostic reasoning, and hallucination reduction. It achieves over 200 tokens/s on H20 hardware, is approximately three times faster than a 36B dense model, and supports a 128K context length via YaRN extrapolation. AntAngelMed ranks #1 open-source on OpenAI's HealthBench, surpasses several proprietary models, and leads on MedAIBench and MedBench across all five dimensions.
Key takeaway
For AI Engineers and MLOps professionals developing medical AI solutions, AntAngelMed presents a compelling open-source option. Its MoE architecture offers a path to deploy highly capable models with significantly reduced inference costs and improved throughput compared to dense models. Consider integrating AntAngelMed for applications requiring extensive medical knowledge and efficient real-time performance, especially given its strong benchmark results on HealthBench and MedBench.
Key insights
AntAngelMed is a 103B medical MoE LLM that achieves high performance with efficient 6.1B parameter inference.
Principles
- MoE architectures balance knowledge capacity and inference cost.
- Multi-stage training improves medical LLM performance and safety.
Method
AntAngelMed's training pipeline involves continual pre-training on medical texts, SFT with diverse instruction data, and GRPO-based reinforcement learning using task-specific reward models for refinement.
In practice
- Utilize MoE for large models with constrained inference budgets.
- Apply FP8 + EAGLE3 for significant throughput gains.
- Employ YaRN for extended context window capabilities.
Topics
- AntAngelMed
- Medical LLM
- Mixture-of-Experts
- Inference Optimization
- HealthBench
Code references
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.