Controlled Paraphrase Geometry in Sentence Embedding Space: Local Manifold Modeling and Latent Probing

2026-05-05 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

This paper investigates the local geometry of sentence embedding spaces, specifically how controlled semantic variation is organized when semantically close sentences are encoded. The authors propose a local geometric modeling scheme using affine, quadratic, and cubic fitted models to approximate low-dimensional nonlinear structures within embedding clouds. They introduce a surface-based latent probing procedure to construct synthetic latent points in a reduced PCA space, evaluating their validity based on consistency with the fitted surface, neighborhood structure, empirical distribution, and stability of local second-order shape and model coefficients. Experiments on the CoPaGE-300K dataset, a controlled template-based collection of over 300,000 sentence variants, demonstrate that nonlinear models more accurately describe embedding clouds than affine models. While surface-based generation provides strong fitted-geometry fidelity, this geometric validity does not automatically translate into improved downstream classification performance, highlighting a distinction between geometric validity and discriminative utility.

Key takeaway

For NLP engineers and research scientists working with sentence embeddings, recognize that controlled semantic variations create measurable, nonlinear geometric structures in embedding spaces. While surface-based generation can produce geometrically valid synthetic points consistent with these structures, these points may not inherently boost downstream classification performance. Consider combining geometry-aware latent generation with discriminative filtering to select synthetic points that contribute meaningfully to decision boundaries, rather than relying solely on geometric fidelity for classification tasks.

Key insights

Controlled semantic variation induces measurable, nonlinear local geometry in sentence embedding spaces.

Principles

Nonlinear models describe embedding clouds more accurately than affine models.
Geometric validity does not automatically imply discriminative utility.

Method

Project embedding clouds to a reduced PCA space, fit low-degree surfaces (quadratic/cubic), then generate synthetic latent points by barycentric initialization followed by projection onto the fitted surface.

In practice

Use CoPaGE-300K for local geometry and latent probing studies.
Employ adaptive PCA dimensionality based on explained variance.
Distinguish geometric validity from classification utility in augmentation.

Topics

Controlled Paraphrase Geometry
Sentence Embeddings
Local Manifold Modeling
Latent Probing
Fitted Surfaces

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.