Part 2: Turning My Writing into a Training Set
Summary
This article details a hands-on method for creating a custom training dataset to fine-tune a small language model (SLM) to mimic a personal writing style and thought process. The author aimed to develop an SLM as a "thinking partner" for theology coursework, converting approximately 40 personal documents into question-and-answer (Q&A) pairs. The process involved a "generation prompt" used with a large language model (LLM), which included an "author profile" to guide the LLM's voice and style, and strict instructions to ground answers in the source material, preventing invention. Initial attempts with all documents at once resulted in repetitive output, which was resolved by splitting documents into nine thematic batches. Using a GLM API via Anthropic's API proxy, 364 Q&A pairs were generated (327 for training, 37 for validation). This approach is termed "grounded generation," where the LLM acts as a transcriber, ensuring knowledge and positions originate from the author's source material.
Key takeaway
For AI Engineers or content creators building personalized language models, this "grounded generation" approach offers a robust method for creating high-quality, authentic training datasets. You should define a clear author profile and constrain your LLM to specific source documents to ensure the fine-tuned model reflects your unique voice and knowledge, rather than generic AI output. This prevents model "hallucination" of positions and ensures content fidelity.
Key insights
A method called "grounded generation" creates personalized fine-tuning datasets by constraining LLMs to source documents, ensuring authentic voice and content.
Principles
- LLMs can transcribe personal voice, not just teach.
- Grounding generation in source material prevents invention.
- Author profiles guide LLM style and voice.
Method
Develop a generation prompt including an author profile and source documents. Instruct the LLM to create Q&A pairs grounded in the sources, avoiding invention. Batch source documents by theme to maintain variety.
In practice
- Use an author profile to define desired AI voice.
- Split source documents into thematic batches.
- Generate diverse Q&A types: genuine, scenario, pushback.
Topics
- Fine-tuning
- Language Models
- Training Data Generation
- Grounded Generation
- Personalization
- Prompt Engineering
- On-device AI
Best for: Machine Learning Engineer, AI Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.