Tagging my blog posts with BERTopic and LLMs
Summary
The author successfully implemented a blog post tagging system using a blend of BERTopic and various Large Language Models (LLMs) in 6-10 hours over a month, completing a project previously abandoned in 2023. This 2026 effort leveraged advancements like significantly larger LLM context windows (200k-1M tokens), agentic harnesses such as Pi, and hybrid machine learning strategies. The process involved BERTopic for initial unsupervised topic clustering, which embeds documents, reduces dimensionality with UMAP, and clusters with HDBSCAN. Subsequently, LLMs like Gemini, GPT-OSS, and Claude Code were employed to refine generic BERTopic clusters into meaningful tags and batch-tag posts. The author also utilized LLMs for UI generation, highlighting their utility for small, well-defined personal development tasks.
Key takeaway
For software engineers or content creators building static sites, if you are considering adding advanced content discovery features like tagging, you should adopt a blended approach. Combining tools like BERTopic for initial topic clustering with LLMs via agentic harnesses significantly reduces development time for tasks like refining tags and batch-processing content. This strategy allows for rapid iteration on taxonomy and UI, but be mindful of rising LLM token costs, which may necessitate greater reliance on efficient, traditional ML components.
Key insights
Blending BERTopic with LLMs and agentic harnesses efficiently creates robust content tagging systems for personal projects.
Principles
- LLMs compress and label efficiently, beyond generation.
- Blended ML (LLMs + classical) yields effective systems.
- Agentic harnesses streamline LLM-driven workflows.
Method
Combine BERTopic for unsupervised document clustering and initial topic extraction with LLMs for refining generic clusters into specific tags. Utilize agentic harnesses for efficient batch-tagging and UI generation, iterating on taxonomy and presentation.
In practice
- Apply BERTopic for initial topic clustering.
- Use LLMs to refine cluster labels and batch-tag content.
Topics
- BERTopic
- Large Language Models
- Topic Modeling
- Agentic AI
- Content Tagging
- Static Sites
Code references
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Tech Blog on ✰Vicki Boykis✰.