Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, long

Summary

A new hybrid pre-training objective for text encoders, "Predict and Reconstruct," combines a Joint Embedding Predictive Architecture (JEPA)-style latent-space prediction loss with a Masked Language Modeling (MLM) reconstruction loss. This approach uses a single shared encoder and a learnable scalar \u03bb to balance the objectives during training. Researchers pre-trained both the hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and NVIDIA H100 compute budgets. Extensive analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) and four pooling strategies revealed the hybrid encoder produces significantly more uniform embeddings (uniformity \u2264-0.16 vs. -0.05 for MLM), exhibits richer spectral geometry, and encodes less surface-level lexical information. Despite similar linear-probe downstream accuracy, these geometric differences are consistent and significant, suggesting the JEPA objective reshapes the latent space for deeper semantic structure.

Key takeaway

For AI Scientists and Machine Learning Engineers developing language models, you should consider hybrid pre-training objectives that combine latent-space prediction with masked language modeling. While linear-probe accuracy might not immediately improve, this approach yields more uniform embeddings and richer semantic representations, which are crucial for non-linear downstream tasks or retrieval. Explore geometric analysis metrics like uniformity and effective rank to fully assess your model's representational quality beyond standard accuracy scores.

Key insights

Combining JEPA latent prediction with MLM improves text encoder embedding uniformity and semantic-lexical balance without immediate linear-probe accuracy gains.

Principles

Latent-space prediction fosters abstract representations.
MLM alone prioritizes surface-form lexical details.
Geometric metrics reveal hidden representation quality.

Method

A shared encoder processes input tokens for both JEPA (block masking, cosine prediction loss against EMA-updated target encoder) and MLM (BERT masking, cross-entropy loss via token regressor) branches, balanced by a learnable \u03bb.

In practice

Evaluate embeddings using uniformity and spectral metrics.
Consider max pooling for spectral richness analysis.
Use attention pooling to amplify objective differences.

Topics

Self-Supervised Learning
Language Representation Learning
Joint Embedding Predictive Architectures
Masked Language Modeling
Embedding Uniformity
Representation Geometry
GLUE Benchmark

Code references

aymen-000/predict-reconstruct-language-models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.