Part 2: Turning My Writing into a Training Set

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, short

Summary

This article details a hands-on method for creating a custom training dataset to fine-tune a small language model (SLM) to mimic a personal writing style and thought process. The author aimed to develop an SLM as a "thinking partner" for theology coursework, converting approximately 40 personal documents into question-and-answer (Q&A) pairs. The process involved a "generation prompt" used with a large language model (LLM), which included an "author profile" to guide the LLM's voice and style, and strict instructions to ground answers in the source material, preventing invention. Initial attempts with all documents at once resulted in repetitive output, which was resolved by splitting documents into nine thematic batches. Using a GLM API via Anthropic's API proxy, 364 Q&A pairs were generated (327 for training, 37 for validation). This approach is termed "grounded generation," where the LLM acts as a transcriber, ensuring knowledge and positions originate from the author's source material.

Key takeaway

For AI Engineers or content creators building personalized language models, this "grounded generation" approach offers a robust method for creating high-quality, authentic training datasets. You should define a clear author profile and constrain your LLM to specific source documents to ensure the fine-tuned model reflects your unique voice and knowledge, rather than generic AI output. This prevents model "hallucination" of positions and ensures content fidelity.

Key insights

A method called "grounded generation" creates personalized fine-tuning datasets by constraining LLMs to source documents, ensuring authentic voice and content.

Principles

Method

Develop a generation prompt including an author profile and source documents. Instruct the LLM to create Q&A pairs grounded in the sources, avoiding invention. Batch source documents by theme to maintain variety.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.