FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Expert, quick

Summary

FrequencyFormer is a co-designed sensor-to-processor pipeline for efficient Vision Transformer (ViT) inference. It addresses limitations in on-device compute, energy, and bandwidth for high-dimensional image data on sensor-edge systems. The system leverages the frequency domain for compact visual representation at the sensor level. Its pipeline includes a multi-scale Discrete Cosine Transform (DCT) tokenizer. This tokenizer compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss. FrequencyFormer also features a LUT-based near-sensor hardware implementation for multiplier-free, energy- and area-efficient tokenization. A modified MIPI-based low-power communication architecture further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding, compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x.

Key takeaway

For AI Hardware Engineers or Computer Vision Engineers deploying Vision Transformers on sensor-edge systems, FrequencyFormer offers a compelling solution. It overcomes compute, energy, and bandwidth limitations. You should investigate frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment. This approach significantly reduces communication energy by 230x and total sensor-side energy by 2.22x. It also maintains compatibility with existing pretrained backbones for various tasks.

Key insights

Frequency-domain tokenization at the sensor-edge dramatically reduces data movement and energy for Vision Transformer inference, enabling efficient deployment.

Principles

Method

FrequencyFormer tokenizes 224x224 images using a multi-scale DCT tokenizer, then processes them with LUT-based near-sensor hardware for multiplier-free operation, and transfers data via a modified MIPI communication architecture.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Hardware Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.