Document-tuning for robust alignment to animals

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Researchers investigated the robustness of value alignment in Large Language Models (LLMs) by fine-tuning with synthetic documents focused on animal compassion. They developed and released the Animal Harm Benchmark (AHB), a 26-question evaluation across 13 ethical dimensions, to assess compassionate reasoning. Training a Llama 3.1 8B model with 3,000 synthetic documents achieved a 77% score on the AHB, significantly outperforming instruction-tuning approaches which scored 40%. This document-tuning also generalized to human compassion and did not degrade standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degraded the intervention, with the advantage disappearing after 5,000 samples. The study suggests that document-based value interventions may require explicit preservation strategies to maintain effectiveness through typical training pipelines and highlights the AHB as a critical tool for tracking progress in animal compassion alignment.

Key takeaway

For research scientists developing AI alignment strategies, consider integrating document-tuning with synthetic data early in the training pipeline to instill robust, generalizable values like compassion. Your efforts should explicitly link these values to the AI's core persona to enhance their persistence. Be aware that subsequent instruction-tuning can degrade these learned values, necessitating strategies to preserve them through later training stages. The Animal Harm Benchmark (AHB) offers a valuable tool for evaluating the depth of value internalization.

Key insights

Document-tuning with synthetic data effectively instills robust, generalizable compassionate values in LLMs, outperforming instruction-tuning.

Principles

Method

Generate synthetic documents linking compassion to an LLM's identity as a helpful, harmless, honest assistant, varying domains while repeating key phrases, and implicitly presenting welfare as pragmatic.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.