Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Human Resources & Workforce Development · Depth: Expert, extended

Summary

This study investigates gender bias in Large Language Model (LLM) hiring decisions within a Japanese corporate context, using 60 rirekisho-format resumes and 12 gender-signaling name pairs. Researchers evaluated five LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B) across 43,200 API calls. A significant pro-female bias was confirmed across all models (β=+0.162, 95% CI [+0.121,+0.203], p<.001; descriptive within-pair effect Δ=+0.162 points, dₛ=+0.219), replicating Western findings. A prompt-level gender-neutrality instruction failed to produce a meaningful reduction in bias. Name-reliance analysis identified candidate names as the primary gender channel, with redaction reducing the female effect by nearly its full magnitude. An unexpected 42% refusal rate for GPT-4o with a privacy filter highlighted deployment challenges for name anonymization.

Key takeaway

For MLOps engineers deploying LLMs in recruitment, you should prioritize system-level name anonymization over prompt-based instructions to mitigate gender bias. The study shows that explicit instructions are ineffective, while name redaction significantly reduces pro-female bias. Be aware that privacy filters can conflict with LLM content safety mechanisms, as seen with GPT-4o's 42% refusal rate, requiring thorough pre-deployment auditing of your pipeline.

Key insights

LLMs exhibit pro-female hiring bias in Japanese contexts, driven by candidate names, and unmitigated by prompt instructions.

Principles

Pro-female LLM bias generalizes across cultures.
Candidate names are primary gender signals.
Prompt instructions are ineffective for bias.

Method

The study used a counterfactual resume design, varying candidate names on 60 rirekisho-format resumes, evaluated by five LLMs, and analyzed with a crossed random-effects linear mixed model.

In practice

Implement system-level name anonymization.
Audit LLM safety filters for conflicts.
Do not rely on prompt-level debiasing.

Topics

LLM Bias
Gender Bias
Hiring Decisions
Japanese Rirekisho
Prompt Engineering
Name Anonymization
GPT-4o

Code references

oltkkol/japnames

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, MLOps Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.