Clustering Unstructured Text with LLM Embeddings and HDBSCAN

2026-06-23 · Source: MachineLearningMastery.com - Machinelearningmastery.com · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

This article details a text clustering pipeline that combines large language model (LLM) embeddings with HDBSCAN to automatically discover topics in unlabeled text data. The process involves generating text embeddings using a pre-trained `sentence-transformers` model, specifically `all-MiniLM-L6-v2`, to capture semantic meaning. These high-dimensional embeddings are then reduced to 5 dimensions using UMAP to prepare them for clustering. Finally, the HDBSCAN algorithm is applied with hyperparameters like `min_cluster_size=8` and `min_samples=3` to identify topic clusters and potential noise points. The demonstration uses a sampled `fetch_20newsgroups` dataset of 150 instances, successfully identifying two distinct clusters, and includes visualizations of the reduced embeddings.

Key takeaway

For Machine Learning Engineers building unsupervised text analysis systems, this pipeline offers a robust method to automatically identify topics in unlabeled data. You should integrate `sentence-transformers` for semantic embeddings, UMAP for efficient dimensionality reduction, and HDBSCAN for flexible cluster detection. Ensure you experiment with HDBSCAN hyperparameters like `min_cluster_size` to optimize topic discovery for your specific datasets, as these settings significantly influence the clustering results.

Key insights

Combining LLM embeddings with UMAP and HDBSCAN enables automatic topic discovery in unstructured text.

Principles

LLM embeddings capture semantic meaning and linguistic nuances.
HDBSCAN automatically determines optimal cluster numbers and detects noise.
Dimensionality reduction improves clustering efficiency and visualization.

Method

Generate text embeddings using `sentence-transformers`, reduce dimensionality with UMAP (e.g., to 5 components), then apply HDBSCAN to discover topic clusters and visualize results.

In practice

Use `all-MiniLM-L6-v2` for lightweight embedding generation.
Experiment with HDBSCAN hyperparameters like `min_cluster_size`.
Visualize UMAP-reduced embeddings for cluster insights.

Topics

LLM Embeddings
HDBSCAN
UMAP
Text Clustering
Sentence Transformers
Unsupervised Learning

Best for: Machine Learning Engineer, AI Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MachineLearningMastery.com - Machinelearningmastery.com.