Clustering Unstructured Text with LLM Embeddings and HDBSCAN
Summary
This article details a text clustering pipeline that combines large language model (LLM) embeddings with HDBSCAN to automatically discover topics in unlabeled text data. The process involves generating text embeddings using a pre-trained `sentence-transformers` model, specifically `all-MiniLM-L6-v2`, to capture semantic meaning. These high-dimensional embeddings are then reduced to 5 dimensions using UMAP to prepare them for clustering. Finally, the HDBSCAN algorithm is applied with hyperparameters like `min_cluster_size=8` and `min_samples=3` to identify topic clusters and potential noise points. The demonstration uses a sampled `fetch_20newsgroups` dataset of 150 instances, successfully identifying two distinct clusters, and includes visualizations of the reduced embeddings.
Key takeaway
For Machine Learning Engineers building unsupervised text analysis systems, this pipeline offers a robust method to automatically identify topics in unlabeled data. You should integrate `sentence-transformers` for semantic embeddings, UMAP for efficient dimensionality reduction, and HDBSCAN for flexible cluster detection. Ensure you experiment with HDBSCAN hyperparameters like `min_cluster_size` to optimize topic discovery for your specific datasets, as these settings significantly influence the clustering results.
Key insights
Combining LLM embeddings with UMAP and HDBSCAN enables automatic topic discovery in unstructured text.
Principles
- LLM embeddings capture semantic meaning and linguistic nuances.
- HDBSCAN automatically determines optimal cluster numbers and detects noise.
- Dimensionality reduction improves clustering efficiency and visualization.
Method
Generate text embeddings using `sentence-transformers`, reduce dimensionality with UMAP (e.g., to 5 components), then apply HDBSCAN to discover topic clusters and visualize results.
In practice
- Use `all-MiniLM-L6-v2` for lightweight embedding generation.
- Experiment with HDBSCAN hyperparameters like `min_cluster_size`.
- Visualize UMAP-reduced embeddings for cluster insights.
Topics
- LLM Embeddings
- HDBSCAN
- UMAP
- Text Clustering
- Sentence Transformers
- Unsupervised Learning
Best for: Machine Learning Engineer, AI Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MachineLearningMastery.com - Machinelearningmastery.com.