Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies
Summary
This study investigates gender bias in Large Language Model (LLM) hiring decisions within a Japanese corporate context, using 60 rirekisho-format resumes and 12 gender-signaling name pairs. Researchers evaluated five LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B) across 43,200 API calls. A significant pro-female bias was confirmed across all models (β=+0.162, 95% CI [+0.121,+0.203], p<.001; descriptive within-pair effect Δ=+0.162 points, dₛ=+0.219), replicating Western findings. A prompt-level gender-neutrality instruction failed to produce a meaningful reduction in bias. Name-reliance analysis identified candidate names as the primary gender channel, with redaction reducing the female effect by nearly its full magnitude. An unexpected 42% refusal rate for GPT-4o with a privacy filter highlighted deployment challenges for name anonymization.
Key takeaway
For MLOps engineers deploying LLMs in recruitment, you should prioritize system-level name anonymization over prompt-based instructions to mitigate gender bias. The study shows that explicit instructions are ineffective, while name redaction significantly reduces pro-female bias. Be aware that privacy filters can conflict with LLM content safety mechanisms, as seen with GPT-4o's 42% refusal rate, requiring thorough pre-deployment auditing of your pipeline.
Key insights
LLMs exhibit pro-female hiring bias in Japanese contexts, driven by candidate names, and unmitigated by prompt instructions.
Principles
- Pro-female LLM bias generalizes across cultures.
- Candidate names are primary gender signals.
- Prompt instructions are ineffective for bias.
Method
The study used a counterfactual resume design, varying candidate names on 60 rirekisho-format resumes, evaluated by five LLMs, and analyzed with a crossed random-effects linear mixed model.
In practice
- Implement system-level name anonymization.
- Audit LLM safety filters for conflicts.
- Do not rely on prompt-level debiasing.
Topics
- LLM Bias
- Gender Bias
- Hiring Decisions
- Japanese Rirekisho
- Prompt Engineering
- Name Anonymization
- GPT-4o
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, MLOps Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.