Document-tuning for robust alignment to animals
Summary
Researchers investigated the robustness of value alignment in Large Language Models (LLMs) by fine-tuning with synthetic documents focused on animal compassion. They developed and released the Animal Harm Benchmark (AHB), a 26-question evaluation across 13 ethical dimensions, to assess compassionate reasoning. Training a Llama 3.1 8B model with 3,000 synthetic documents achieved a 77% score on the AHB, significantly outperforming instruction-tuning approaches which scored 40%. This document-tuning also generalized to human compassion and did not degrade standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degraded the intervention, with the advantage disappearing after 5,000 samples. The study suggests that document-based value interventions may require explicit preservation strategies to maintain effectiveness through typical training pipelines and highlights the AHB as a critical tool for tracking progress in animal compassion alignment.
Key takeaway
For research scientists developing AI alignment strategies, consider integrating document-tuning with synthetic data early in the training pipeline to instill robust, generalizable values like compassion. Your efforts should explicitly link these values to the AI's core persona to enhance their persistence. Be aware that subsequent instruction-tuning can degrade these learned values, necessitating strategies to preserve them through later training stages. The Animal Harm Benchmark (AHB) offers a valuable tool for evaluating the depth of value internalization.
Key insights
Document-tuning with synthetic data effectively instills robust, generalizable compassionate values in LLMs, outperforming instruction-tuning.
Principles
- Link desired values to AI identity for stronger effects.
- Statistical co-occurrence fosters value internalization.
- Domain diversity with lexical repetition aids generalization.
Method
Generate synthetic documents linking compassion to an LLM's identity as a helpful, harmless, honest assistant, varying domains while repeating key phrases, and implicitly presenting welfare as pragmatic.
In practice
- Use Gemini 2.5 Flash-Lite for high-quality synthetic data generation.
- Prioritize document-tuning over instruction-tuning for robust value alignment.
- Implement explicit preservation strategies for document-tuned values.
Topics
- Document-tuning
- Value Alignment
- Animal Harm Benchmark
- Synthetic Data Generation
- Instruction-tuning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.