Attention Is Not What You Think!

2026-01-11 · Source: Agus’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

The article demystifies in-context learning for tabular data, asserting that it is fundamentally "row attention" and has been a feature of Random Forests for two decades. Using the California Housing dataset, the author demonstrates how a Random Forest Regressor, trained with `n_estimators=200`, `max_depth=10`, and `min_samples_leaf=10`, implicitly learns row similarity. This "Random Forest proximity" is calculated by identifying how often rows fall into the same leaf across multiple trees, yielding a similarity score between 0 and 1. This proximity is then normalized into "attention weights" to predict a target value, such as a house price, by aggregating outcomes from similar rows. This mechanism, which is nonlinear, conditional, and feature-selective, is presented as a more sophisticated form of attention compared to kNN's fixed distance metric, achieving similar results to Transformer-based in-context learning without complex architectures.

Key takeaway

For Machine Learning Engineers evaluating advanced models for tabular data, understand that "in-context learning" is not exclusive to Transformers. Your existing Random Forest models already perform a sophisticated form of row attention, offering a robust, interpretable alternative for tasks requiring dynamic sample weighting. Consider exploring GBDT proximity or combining row and column attention to enhance your current tabular ML approaches.

Key insights

In-context learning for tabular data is row attention, a mechanism Random Forests have employed for 20 years.

Principles

Row attention identifies relevant rows and their influence.
Random Forest proximity measures row similarity via shared leaves.
Learned similarity often surpasses fixed distance metrics.

Method

Train a Random Forest, compute row proximity by counting shared leaves across trees, normalize these counts into attention weights, and use these weights to aggregate target values for prediction.

In practice

Use `rf.apply(X)` to get leaf indices.
Calculate proximity as mean of shared leaf occurrences.
Normalize proximity for attention weights.

Topics

In-Context Learning
Random Forests
Attention Mechanisms
Tabular Data
Random Forest Proximity

Best for: Machine Learning Engineer, Data Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Agus’s Substack.