FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

2026-04-08 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, medium

Summary

FedDetox is a robust framework designed to enhance the safety alignment of Small Language Models (SLMs) in Federated Learning (FL) environments, particularly on resource-constrained edge devices. The framework addresses the issue of "unintended data poisoning" where real-world client data contains toxic or unsafe information, potentially damaging global model safety. FedDetox employs knowledge distillation to transfer advanced safety alignment capabilities from large teacher models to lightweight student classifiers suitable for edge devices. During federated human preference alignment, the edge client identifies unsafe samples at the source and replaces them with refusal templates, converting potential poisons into positive safety signals. Experiments show that FedDetox maintains model safety comparable to centralized baselines without sacrificing general utility.

Key takeaway

For research scientists developing federated learning systems for Small Language Models, you should consider integrating on-device data sanitization techniques like FedDetox. This approach effectively mitigates unintended data poisoning from user-generated content, preserving model safety and alignment without compromising overall utility, which is crucial for deploying robust SLMs on resource-constrained edge devices.

Key insights

FedDetox ensures SLM safety in federated learning by sanitizing toxic on-device data via knowledge distillation and refusal templates.

Principles

Federated learning can introduce unintended data poisoning.
Knowledge distillation transfers safety capabilities to edge devices.
Transforming toxic data into safety signals improves alignment.

Method

Transfer safety alignment from large teacher models to lightweight student classifiers via knowledge distillation. On edge devices, identify unsafe samples and replace them with refusal templates during federated human preference alignment.

In practice

Deploy lightweight safety classifiers on edge devices.
Implement on-device data sanitization for FL.
Use refusal templates to convert toxic inputs.

Topics

FedDetox
Federated Learning
Small Language Models
On-Device Data Sanitization
Safety Alignment

Code references

danny0628/HEART-PFL

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.