「データ不足」の壁を越える:合成ペルソナが日本のAI開発を加速
Summary
NTT DATA's new research, published February 19, 2026, demonstrates how synthetic data can overcome Japan's critical data scarcity challenge in AI development, which hinders the country's potential to create over 100 trillion yen (650 billion USD) in economic value. The study utilized NVIDIA Nemotron-Personas-Japan, an open synthetic dataset of 6 million personas based on Japanese demographics, generated with NeMo Data Designer. By augmenting just 450 raw seed samples with 500 personas from this dataset, NTT DATA generated over 138,000 synthetic training data points, improving their proprietary LLM, "tsuzumi 2," accuracy from 15.3% to 79.3% in legal Q&A tasks. This 60-percentage-point improvement was achieved without exposing sensitive data and eliminated hallucinations. The findings suggest that Continuous Pre-Training (CPT) may not be essential for some use cases if sufficient synthetic data for Supervised Fine-Tuning (SFT) is available, leading to reduced computing costs and faster development cycles.
Key takeaway
For NLP Engineers and CTOs building AI systems in data-scarce or privacy-sensitive domains like Japan, consider integrating synthetic data generation into your development pipeline. NTT DATA's results show that using open synthetic datasets like Nemotron-Personas-Japan can dramatically improve model accuracy and consistency while reducing reliance on costly Continuous Pre-Training. This approach enables the creation of high-performance, culturally-rooted, and privacy-compliant AI models, accelerating development cycles and lowering computational expenses.
Key insights
Synthetic data can overcome data scarcity and privacy concerns to accelerate AI development, especially in culturally specific domains.
Principles
- Synthetic data improves model accuracy and consistency.
- Privacy-Enhancing Technologies (PET) balance compliance and performance.
- Sovereign data spaces foster secure, collaborative AI development.
Method
Augment minimal proprietary seed data with large-scale synthetic personas to generate extensive training datasets for Supervised Fine-Tuning (SFT), potentially bypassing Continuous Pre-Training (CPT).
In practice
- Use Nemotron-Personas-Japan for Japanese-centric AI.
- Explore NeMo Data Designer for synthetic data generation.
- Prioritize SFT with synthetic data to reduce compute costs.
Topics
- Synthetic Data Generation
- Japan AI Development
- Nemotron-Personas-Japan
- Privacy-Enhancing Technologies
- LLM Fine-tuning
Best for: NLP Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.