Qwen 3.5 Test for JSON Structured Data Extraction
Summary
An evaluation of the new Qwen 3.5 large language models (LLMs) was conducted on a local Mac Mini M4 Pro with 64 GB RAM using the MLX VLM framework. The tests focused on data extraction from a tabular input, requesting JSON output. Three models were assessed: the 9 billion parameter model (BF16 quantization), the 27 billion parameter model (8-bit quantization), and the 35 billion parameter model (8-bit quantization). A key optimization involved disabling the default "thinking mode" in MLX VLM to significantly reduce processing time. The 9B model processed the request in 25 seconds at 13 tokens/second, consuming 60% of RAM. The 27B model took 51 seconds at 8 tokens/second, using close to 80% of RAM. Surprisingly, the 35B model completed the task in 13 seconds at 52 tokens/second, also using around 80% of RAM, demonstrating superior performance for this specific task.
Key takeaway
For AI Engineers deploying Qwen 3.5 models on local Apple Silicon hardware, consider using 8-bit quantization for larger models like the 35B variant, as it demonstrated superior speed and efficiency compared to smaller models with BF16. Crucially, disable the default "thinking mode" in MLX VLM via `enable_thinking=false` to achieve substantial inference speed improvements, optimizing resource utilization and response times for data extraction tasks.
Key insights
Disabling "thinking mode" and using 8-bit quantization can significantly boost LLM inference speed on local hardware.
Principles
- Larger models are not always slower.
- Quantization impacts speed and memory.
- Default settings may not be optimal.
Method
Evaluate LLMs by running data extraction tasks with JSON output, comparing BF16 and 8-bit quantization, and disabling default "thinking mode" for performance gains.
In practice
- Disable "thinking mode" in MLX VLM.
- Test 8-bit quantization for larger models.
- Benchmark different model sizes for specific tasks.
Topics
- Qwen 3.5
- Large Language Models
- Model Quantization
- Local AI Inference
- emix VLM
Best for: Machine Learning Engineer, AI Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Andrej Baranovskij.