Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Internet of Things (IoT) & Connected Devices, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A real-time conversational assistant has been developed that guides users through procedural manual tasks, such as furniture assembly, using only audio and Inertial Measurement Unit (IMU) data from a wearable device. This approach significantly reduces computational costs and enhances user privacy compared to video-based systems. Researchers from Qualcomm Technologies, Inc. created a dataset of 600 conversations and introduced a novel User Whim Agnostic (UWA) LoRA finetuning method for language models like Qwen2.5-1.5B-Instruct and Qwen2.5-3B-Instruct. This finetuning achieved over 30% F-score improvement and a 16x inference speedup by enabling the model to suppress uninformative dialogues while retaining critical instructions. The system operates entirely on edge devices, utilizing Snapdragon W5 Gen 1 and Dragonwing IQ9 processors, with components like Whisper-medium for ASR and MeloTTS-English for TTS.

Key takeaway

For Machine Learning Engineers developing real-time, privacy-sensitive conversational AI, you should explore non-video modalities like audio and IMU from wearables. Implementing the User Whim Agnostic (UWA) LoRA finetuning method can significantly improve model efficiency and user experience by reducing uninformative dialogue and speeding up inference by 16x. Consider edge deployment on Qualcomm processors to ensure low-latency, cloud-independent operation, enhancing both privacy and responsiveness for procedural task assistants.

Key insights

A privacy-preserving, edge-deployed conversational assistant uses audio/IMU and UWA LoRA finetuning for proactive, efficient task guidance.

Principles

Method

The system captures audio/IMU from a smartwatch, recognizes activities, transcribes user speech, and feeds recent dialogues to a LoRA-finetuned language model. A rule-based step tracker provides context and suggests messages, with UWA finetuning optimizing proactive instruction delivery.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.