How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions
Summary
A study investigated the deployment suitability of 24 LoRA-fine-tuned language model variants, ranging from 270M to 8B parameters, for extracting structured merchant information from financial transaction strings. The research aimed to find efficient alternatives to a production LLaMA 3.1-8B system, which achieves 96.95% F1 but incurs high memory, latency, and cost. Key findings include a LoRA rank 8 LLaMA 3.1-8B achieving 96.75% F1, only 0.20 points below the rank-32 baseline. Qwen 3.5 4B with JSON-only prompting reached 96.60% F1, within 0.35 points of the 8B baseline using half the parameters. The 0.8B Qwen 3.5 model achieved 94.75% F1, matching larger models. Chain-of-thought fine-tuning generally improved F1 by 0.3-1.8 points, though Qwen 3.5 4B preferred direct JSON-only prompting. Explicit reasoning supervision was found unnecessary for structured extraction, with Qwen 3.5 Think and Nothink templates yielding F1 differences below 0.004. Benchmark performance transferred reliably to production on Databricks Model Serving, with an average F1 change of 0.8 points, except for Aya 3.35B, which saw a 3-5 point decline.
Key takeaway
For MLOps Engineers deploying large language models for structured information extraction, this research indicates you can achieve near 8B-parameter performance with significantly smaller models. Consider LoRA-fine-tuned Qwen 3.5 4B with JSON-only prompting for 96.60% F1, or Qwen 3.5 0.8B for latency-critical applications at 94.75% F1. You should benchmark smaller models like these to reduce memory, latency, and cost constraints in production.
Key insights
Smaller LoRA-fine-tuned models can achieve near 8B performance for structured extraction, significantly reducing deployment costs.
Principles
- LoRA rank 8 is competitive with rank 32 for fine-tuning.
- JSON-only prompting can outperform CoT for specific models.
- Explicit reasoning supervision is not always needed for structured extraction.
Method
The study systematically evaluated 24 LoRA-fine-tuned model variants (Gemma 3, Qwen 3.5, Aya, LLaMA 3.1-8B) for merchant information extraction, assessing accuracy, inference throughput, training cost, and hardware behavior.
In practice
- Consider Qwen 3.5 4B for balanced performance and cost.
- Evaluate Qwen 3.5 0.8B for latency-sensitive tasks.
- Test JSON-only prompting for structured extraction tasks.
Topics
- LoRA Fine-tuning
- Merchant Information Extraction
- Large Language Models
- Qwen 3.5
- Model Deployment
- Inference Optimization
Best for: AI Engineer, NLP Engineer, CTO, Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.