Qwen 3.5 Test for JSON Structured Data Extraction

· Source: Andrej Baranovskij · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

An evaluation of the new Qwen 3.5 large language models (LLMs) was conducted on a local Mac Mini M4 Pro with 64 GB RAM using the MLX VLM framework. The tests focused on data extraction from a tabular input, requesting JSON output. Three models were assessed: the 9 billion parameter model (BF16 quantization), the 27 billion parameter model (8-bit quantization), and the 35 billion parameter model (8-bit quantization). A key optimization involved disabling the default "thinking mode" in MLX VLM to significantly reduce processing time. The 9B model processed the request in 25 seconds at 13 tokens/second, consuming 60% of RAM. The 27B model took 51 seconds at 8 tokens/second, using close to 80% of RAM. Surprisingly, the 35B model completed the task in 13 seconds at 52 tokens/second, also using around 80% of RAM, demonstrating superior performance for this specific task.

Key takeaway

For AI Engineers deploying Qwen 3.5 models on local Apple Silicon hardware, consider using 8-bit quantization for larger models like the 35B variant, as it demonstrated superior speed and efficiency compared to smaller models with BF16. Crucially, disable the default "thinking mode" in MLX VLM via `enable_thinking=false` to achieve substantial inference speed improvements, optimizing resource utilization and response times for data extraction tasks.

Key insights

Disabling "thinking mode" and using 8-bit quantization can significantly boost LLM inference speed on local hardware.

Principles

Method

Evaluate LLMs by running data extraction tasks with JSON output, comparing BF16 and 8-bit quantization, and disabling default "thinking mode" for performance gains.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Andrej Baranovskij.