Memory in LLMs: Weights and Activations - Jack Morris, Cornell

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

Jack Morris from Cornell discusses the limitations of current Large Language Models (LLMs) like ChatGPT, particularly their knowledge cut-off dates, inability to handle niche or "longtail" tasks, and lack of company-specific information. He identifies three primary approaches to inject knowledge into LLMs: full context, Retrieval Augmented Generation (RAG), and training knowledge directly into model weights. While full context is simple for small datasets, it becomes prohibitively expensive and slow for larger inputs, with performance degrading significantly beyond 10,000 tokens due to Transformer self-attention's quadratic dependency. RAG, widely adopted and powered by vector databases, also faces fundamental limitations, including non-adaptive embeddings and an inability to capture complex combinatorial relationships or perform latent reasoning across multiple documents. Morris argues that training knowledge into weights, despite being more complex and expensive upfront, offers a path to more capable and efficient LLMs for specialized tasks.

Key takeaway

For AI Engineers building specialized LLM applications, relying solely on RAG or full context windows for proprietary or niche data is increasingly inefficient and limited. You should investigate methods for training knowledge directly into model weights, leveraging synthetic data generation and parameter-efficient fine-tuning techniques like LoRA or memory layers. This approach, while requiring higher upfront computational investment, promises more performant, cost-effective inference and better reasoning capabilities for your specific use cases, moving beyond the fundamental limitations of current retrieval-based systems.

Key insights

Current LLMs struggle with niche knowledge and context scaling; training knowledge into weights offers a path beyond RAG's limitations.

Principles

Method

Generate diverse synthetic data from a small dataset, then use parameter-efficient fine-tuning (e.g., LoRA, prefix tuning, memory layers) to inject this knowledge into specific model weights, minimizing forgetting.

In practice

Topics

Best for: AI Engineer, NLP Engineer, CTO, Machine Learning Engineer, AI Researcher, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.