「データ不足」の壁を越える：合成ペルソナが日本のAI開発を加速

2026-02-19 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Emerging Technologies & Innovation · Depth: Intermediate, medium

Summary

NTT DATA's new research, published February 19, 2026, demonstrates how synthetic data can overcome Japan's critical data scarcity challenge in AI development, which hinders the country's potential to create over 100 trillion yen (650 billion USD) in economic value. The study utilized NVIDIA Nemotron-Personas-Japan, an open synthetic dataset of 6 million personas based on Japanese demographics, generated with NeMo Data Designer. By augmenting just 450 raw seed samples with 500 personas from this dataset, NTT DATA generated over 138,000 synthetic training data points, improving their proprietary LLM, "tsuzumi 2," accuracy from 15.3% to 79.3% in legal Q&A tasks. This 60-percentage-point improvement was achieved without exposing sensitive data and eliminated hallucinations. The findings suggest that Continuous Pre-Training (CPT) may not be essential for some use cases if sufficient synthetic data for Supervised Fine-Tuning (SFT) is available, leading to reduced computing costs and faster development cycles.

Key takeaway

For NLP Engineers and CTOs building AI systems in data-scarce or privacy-sensitive domains like Japan, consider integrating synthetic data generation into your development pipeline. NTT DATA's results show that using open synthetic datasets like Nemotron-Personas-Japan can dramatically improve model accuracy and consistency while reducing reliance on costly Continuous Pre-Training. This approach enables the creation of high-performance, culturally-rooted, and privacy-compliant AI models, accelerating development cycles and lowering computational expenses.

Key insights

Synthetic data can overcome data scarcity and privacy concerns to accelerate AI development, especially in culturally specific domains.

Principles

Synthetic data improves model accuracy and consistency.
Privacy-Enhancing Technologies (PET) balance compliance and performance.
Sovereign data spaces foster secure, collaborative AI development.

Method

Augment minimal proprietary seed data with large-scale synthetic personas to generate extensive training datasets for Supervised Fine-Tuning (SFT), potentially bypassing Continuous Pre-Training (CPT).

In practice

Use Nemotron-Personas-Japan for Japanese-centric AI.
Explore NeMo Data Designer for synthetic data generation.
Prioritize SFT with synthetic data to reduce compute costs.

Topics

Synthetic Data Generation
Japan AI Development
Nemotron-Personas-Japan
Privacy-Enhancing Technologies
LLM Fine-tuning

Best for: NLP Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.