Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

LoRA-MINT introduces a novel Membership Inference Test (MINT) methodology for auditing training data in Large Language Models (LLMs) fine-tuned with Low-Rank Adaptation (LoRA). This framework assesses whether specific data samples were used during model training, serving as a critical tool for managing intellectual property and sensitive information. The method analyzes the relationship between model perplexity and data membership, providing a systematic way to estimate data exposure. Experiments across four diverse LLMs and three benchmark datasets (CAMELMaths Instruction Dataset, Maths-College, Medical-o1-SFT) yielded precision values from 0.77 to 0.92 and AUC values from 0.77 to 0.90, surpassing existing baselines. LoRA-MINT's robustness stems from refining synthetic perplexity distributions through percentile filtering and mean adjustment, demonstrating its effectiveness and general applicability beyond LoRA-specific models.

Key takeaway

For AI Security Engineers or Data Privacy Officers deploying domain-adapted LLMs, LoRA-MINT offers a crucial auditing framework. You should integrate this perplexity-based method to systematically identify if sensitive or proprietary data was memorized during LoRA fine-tuning. This enables proactive risk management, ensures compliance with regulations like the EU AI Act, and enhances transparency regarding data exposure in your AI systems. Implementing LoRA-MINT helps safeguard intellectual property and maintain user privacy.

Key insights

LoRA-MINT uses perplexity and synthetic samples to audit training data membership in LoRA-fine-tuned LLMs.

Principles

Method

Generate synthetic in-domain samples, filter extreme perplexity percentiles, adjust the mean, compute candidate perplexities, and classify against the refined reference mean.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.