LLMs Need Encoders for Semantic IDs Too

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

A new lightweight Semantic ID (SID) encoder, PrefixMem, is proposed to enhance Large Language Models (LLMs) in generative recommendation systems. The authors argue that SIDs, which are hierarchical codes with context-dependent meanings, function as a distinct modality similar to images or audio, necessitating a dedicated encoder. Unlike current methods that merely add SID tokens to an LLM's vocabulary, PrefixMem provides structured, prefix-conditioned representations by leveraging prefix n-gram memory tables. This encoder can be pre-trained independently and then integrated with any LLM for joint training. Evaluations using large-scale Pinterest data across multiple LLM families demonstrated significant improvements: PrefixMem boosted deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at equivalent training compute. Its benefits were particularly pronounced on challenging examples, achieving up to 77% relative accuracy gains where greedy decoding typically failed.

Key takeaway

For Machine Learning Engineers developing generative recommendation systems, consider integrating dedicated Semantic ID (SID) encoders like PrefixMem. Your current LLM approach, which relies solely on vocabulary tokens, likely underperforms on context-dependent SIDs. Adopting a pre-trainable SID encoder can significantly improve deepest-level SID accuracy by up to 46% and retrieval recall by 22%, especially for challenging predictions where greedy decoding struggles. Evaluate PrefixMem to enhance the precision of your recommendation outputs.

Key insights

Semantic IDs in LLMs benefit from dedicated prefix-conditioned encoders like PrefixMem, similar to how multimodal LLMs use vision encoders.

Principles

Method

PrefixMem uses prefix n-gram memory tables to generate structured, prefix-conditioned representations for Semantic ID tokens, which are then fed to the LLM.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.