Safe-FedLLM: Delving into the Safety of Federated Large Language Models

2025-12-27 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Safe-FedLLM is a novel probe-based defense framework designed to enhance the security of federated large language models (FedLLMs) against malicious clients. The framework addresses the vulnerability of LLMs in federated learning environments, where traditional defenses often fail due to the unique characteristics of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) weights. Researchers from Hainan University, Tsinghua University, and Shanghai Jiao Tong University found that FedLLMs are highly susceptible to attacks, but LoRA weights from benign and malicious clients exhibit distinguishable patterns. Safe-FedLLM leverages these patterns by employing a LoRA-Probe to detect malicious updates and integrates a Safety Defense Module operating at Step-Level, Client-Level, and Shadow-Level. Experiments on Llama3.1-8B and Qwen2.5-7B, with malicious client ratios from 20% to 50%, demonstrate that Safe-FedLLM significantly improves robustness and safety without compromising performance or introducing substantial training overhead, increasing total training time by only 3.2%.

Key takeaway

For research scientists developing secure federated learning systems for LLMs, you should consider integrating probe-based defense mechanisms that analyze LoRA weights. This approach offers a lightweight and effective way to identify and suppress malicious client updates, maintaining model safety and utility even under high attack intensity, which is crucial for robust real-world deployments.

Key insights

LoRA weight patterns can effectively distinguish malicious from benign updates in federated LLM training.

Principles

FedLLMs are highly vulnerable to malicious client attacks.
LoRA weights exhibit separable intrinsic properties for different client types.

Method

Safe-FedLLM uses an offline-trained LoRA-Probe to classify client-generated LoRA weight updates as malicious or benign, then applies multi-level defense modules (Step, Client, Shadow) and security-weighted aggregation to mitigate threats.

In practice

Implement probe-based discrimination on LoRA weights.
Utilize a shadow LoRA branch for stable security signal generation.

Topics

Federated Large Language Models
LoRA Weights
Malicious Client Attacks
Probe-based Defense
Model Poisoning

Code references

dmqx/Safe-FedLLM

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.