How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

2026-06-06 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, FinTech & Digital Financial Services · Depth: Advanced, quick

Summary

A study investigated the deployment suitability of 24 LoRA-fine-tuned language model variants, ranging from 270M to 8B parameters, for extracting structured merchant information from financial transaction strings. The research aimed to find efficient alternatives to a production LLaMA 3.1-8B system, which achieves 96.95% F1 but incurs high memory, latency, and cost. Key findings include a LoRA rank 8 LLaMA 3.1-8B achieving 96.75% F1, only 0.20 points below the rank-32 baseline. Qwen 3.5 4B with JSON-only prompting reached 96.60% F1, within 0.35 points of the 8B baseline using half the parameters. The 0.8B Qwen 3.5 model achieved 94.75% F1, matching larger models. Chain-of-thought fine-tuning generally improved F1 by 0.3-1.8 points, though Qwen 3.5 4B preferred direct JSON-only prompting. Explicit reasoning supervision was found unnecessary for structured extraction, with Qwen 3.5 Think and Nothink templates yielding F1 differences below 0.004. Benchmark performance transferred reliably to production on Databricks Model Serving, with an average F1 change of 0.8 points, except for Aya 3.35B, which saw a 3-5 point decline.

Key takeaway

For MLOps Engineers deploying large language models for structured information extraction, this research indicates you can achieve near 8B-parameter performance with significantly smaller models. Consider LoRA-fine-tuned Qwen 3.5 4B with JSON-only prompting for 96.60% F1, or Qwen 3.5 0.8B for latency-critical applications at 94.75% F1. You should benchmark smaller models like these to reduce memory, latency, and cost constraints in production.

Key insights

Smaller LoRA-fine-tuned models can achieve near 8B performance for structured extraction, significantly reducing deployment costs.

Principles

LoRA rank 8 is competitive with rank 32 for fine-tuning.
JSON-only prompting can outperform CoT for specific models.
Explicit reasoning supervision is not always needed for structured extraction.

Method

The study systematically evaluated 24 LoRA-fine-tuned model variants (Gemma 3, Qwen 3.5, Aya, LLaMA 3.1-8B) for merchant information extraction, assessing accuracy, inference throughput, training cost, and hardware behavior.

In practice

Consider Qwen 3.5 4B for balanced performance and cost.
Evaluate Qwen 3.5 0.8B for latency-sensitive tasks.
Test JSON-only prompting for structured extraction tasks.

Topics

LoRA Fine-tuning
Merchant Information Extraction
Large Language Models
Qwen 3.5
Model Deployment
Inference Optimization

Best for: AI Engineer, NLP Engineer, CTO, Machine Learning Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.